runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
       [not found]     ` <201502112128.44852@pali>
@ 2015-02-11 20:40       ` Nishanth Menon
  2015-02-18 21:14         ` Pali Rohár
  2015-05-28  7:37         ` Pali Rohár
  0 siblings, 2 replies; 20+ messages in thread
From: Nishanth Menon @ 2015-02-11 20:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 11, 2015 at 2:28 PM, Pali Roh?r <pali.rohar@gmail.com> wrote:
> On Wednesday 11 February 2015 16:22:51 Matthijs van Duin wrote:
>> On 11 February 2015 at 13:39, Pali Roh?r <pali.rohar@gmail.com>
> wrote:
>> >> Anyhow, since checking the firewalls/APs to see if you have
>> >> permission will probably only get you yet another fault if
>> >> things are walled off, the robust way of dealing with this
>> >> sort of situation is by probing the device with a read
>> >> while trapping bus faults. This also handles modules that
>> >> are unreachable for other reasons, e.g. being disabled by
>> >> eFuse.
>> >
>> > It is possible to patch kernel code to mask or ignore that
>> > fault? Can you help me with something like that?
>>
>> As I mentioned, I'm still learning my way around the kernel,
>> so I don't feel very comfortable suggesting a concrete patch
>> just yet. I've been browsing arch/arm/mm/ however and my
>> impression is that all that would be required is editing
>> fault.c by making a copy of do_bad but containing
>>     return user_mode(regs) || !fixup_exception(regs);
>> and hook it onto the appropriate fault codes.  However, this
>> really needs the opinion of someone more familiar with this
>> code.
>>
>> I do have an observation to make on the issue of fault
>> decoding: the list in fsr-2level.c may be "standard ARMv3 and
>> ARMv4 aborts" but they are quite wrong for ARMv7 which has:
>>
>> [ 0] -
>> [ 1] alignment fault
>> [ 2] debug event
>> [ 3] section access flag fault
>> [ 4] instruction cache maintainance fault (reported via data
>> abort) [ 5] section translation fault
>> [ 6] page access flag fault
>> [ 7] page translation fault
>> [ 8] bus error on access
>> [ 9] section domain fault
>> [10] -
>> [11] page domain fault
>> [12] bus error on section table walk
>> [13] section permission fault
>> [14] bus error on page table walk
>> [15] page permission fault
>> [16] (TLB conflict abort)
>> [17] -
>> [18] -
>> [19] -
>> [20] (lockdown abort)
>> [21] -
>> [22] async bus error (reported via data abort)
>> [23] -
>> [24] async parity/ECC error (reported via data abort)
>> [25] parity/ECC error on access
>> [26] (coprocessor abort)
>> [27] -
>> [28] parity/ECC error on section table walk
>> [29] -
>> [30] parity/ECC error on page table walk
>> [31] -
>>
>> Some entries are patched up near the bottom of fault.c but
>> many bogus messages remain, for example the "on linefetch" vs
>> "on non-linefetch" is misleading since no such thing can be
>> inferred from the fault status on v7.  Also, the i-cache
>> maintenance fault handling looks wrong to me: it should fetch
>> the actual fault status from IFSR (even though the address
>> still comes from DFSR) and dispatch based on that.
>>
>> Async external aborts (async bus error and async parity/ECC
>> error) give you basically no info. DFAR will contain garbage
>> hence displaying it will confuse rather than enlighten, a
>> traceback is pointless since the instruction that caused the
>> access is long retired, likewise user_mode() doesn't matter
>> since a transition to kernel space may have happened after
>> the access that cause the abort. Basically they should be
>> treated more as an IRQ than as a fault (note they can also be
>> masked just like irqs). In case of a bus error, it may be
>> appropriate to just warn about it, or perhaps send a signal
>> to the current process, although in the latter case it should
>> have some means to distinguish it from a synchronous bus
>> error.
>>
>> At least on the cortex-a8, a parity/ECC error (whether async
>> or not) is to be regarded as absolutely fatal.  Quoth the
>> TRM: "No recovery is possible. The abort handler must disable
>> the caches, communicate the fail directly with the external
>> system, request a reboot."
>>
>> Bit 10 no longer indicates an asynchronous (let alone
>> imprecise) fault.  Apart from the debug events and async
>> aborts (and possibly some implementation-defined aborts), all
>> aborts listed are synchronous, and DFAR/IFAR is valid.
>> There's no technical obstruction to make these trappable via
>> the kernel exception handling mechanism. (Though at least in
>> case of parity/ECC errors one shouldn't.)
>
> Tony, Nishanth, or somebody else... can you help with memory
> management? Or do you know some expert for arch/arm/mm/ code?

Folks in linux-arm-kernel are probably the right people, I suppose.
Looping them in.

-- 
---
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-02-11 20:40       ` runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot) Nishanth Menon
@ 2015-02-18 21:14         ` Pali Rohár
  2015-05-28  7:37         ` Pali Rohár
  1 sibling, 0 replies; 20+ messages in thread
From: Pali Rohár @ 2015-02-18 21:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 11 February 2015 21:40:33 Nishanth Menon wrote:
> On Wed, Feb 11, 2015 at 2:28 PM, Pali Roh?r 
<pali.rohar@gmail.com> wrote:
> > On Wednesday 11 February 2015 16:22:51 Matthijs van Duin 
wrote:
> >> On 11 February 2015 at 13:39, Pali Roh?r
> >> <pali.rohar@gmail.com>
> > 
> > wrote:
> >> >> Anyhow, since checking the firewalls/APs to see if you
> >> >> have permission will probably only get you yet another
> >> >> fault if things are walled off, the robust way of
> >> >> dealing with this sort of situation is by probing the
> >> >> device with a read while trapping bus faults. This also
> >> >> handles modules that are unreachable for other reasons,
> >> >> e.g. being disabled by eFuse.
> >> > 
> >> > It is possible to patch kernel code to mask or ignore
> >> > that fault? Can you help me with something like that?
> >> 
> >> As I mentioned, I'm still learning my way around the
> >> kernel, so I don't feel very comfortable suggesting a
> >> concrete patch just yet. I've been browsing arch/arm/mm/
> >> however and my impression is that all that would be
> >> required is editing fault.c by making a copy of do_bad but
> >> containing
> >> 
> >>     return user_mode(regs) || !fixup_exception(regs);
> >> 
> >> and hook it onto the appropriate fault codes.  However,
> >> this really needs the opinion of someone more familiar
> >> with this code.
> >> 
> >> I do have an observation to make on the issue of fault
> >> decoding: the list in fsr-2level.c may be "standard ARMv3
> >> and ARMv4 aborts" but they are quite wrong for ARMv7 which
> >> has:
> >> 
> >> [ 0] -
> >> [ 1] alignment fault
> >> [ 2] debug event
> >> [ 3] section access flag fault
> >> [ 4] instruction cache maintainance fault (reported via
> >> data abort) [ 5] section translation fault
> >> [ 6] page access flag fault
> >> [ 7] page translation fault
> >> [ 8] bus error on access
> >> [ 9] section domain fault
> >> [10] -
> >> [11] page domain fault
> >> [12] bus error on section table walk
> >> [13] section permission fault
> >> [14] bus error on page table walk
> >> [15] page permission fault
> >> [16] (TLB conflict abort)
> >> [17] -
> >> [18] -
> >> [19] -
> >> [20] (lockdown abort)
> >> [21] -
> >> [22] async bus error (reported via data abort)
> >> [23] -
> >> [24] async parity/ECC error (reported via data abort)
> >> [25] parity/ECC error on access
> >> [26] (coprocessor abort)
> >> [27] -
> >> [28] parity/ECC error on section table walk
> >> [29] -
> >> [30] parity/ECC error on page table walk
> >> [31] -
> >> 
> >> Some entries are patched up near the bottom of fault.c but
> >> many bogus messages remain, for example the "on linefetch"
> >> vs "on non-linefetch" is misleading since no such thing
> >> can be inferred from the fault status on v7.  Also, the
> >> i-cache maintenance fault handling looks wrong to me: it
> >> should fetch the actual fault status from IFSR (even
> >> though the address still comes from DFSR) and dispatch
> >> based on that.
> >> 
> >> Async external aborts (async bus error and async parity/ECC
> >> error) give you basically no info. DFAR will contain
> >> garbage hence displaying it will confuse rather than
> >> enlighten, a traceback is pointless since the instruction
> >> that caused the access is long retired, likewise
> >> user_mode() doesn't matter since a transition to kernel
> >> space may have happened after the access that cause the
> >> abort. Basically they should be treated more as an IRQ
> >> than as a fault (note they can also be masked just like
> >> irqs). In case of a bus error, it may be appropriate to
> >> just warn about it, or perhaps send a signal to the
> >> current process, although in the latter case it should
> >> have some means to distinguish it from a synchronous bus
> >> error.
> >> 
> >> At least on the cortex-a8, a parity/ECC error (whether
> >> async or not) is to be regarded as absolutely fatal. 
> >> Quoth the TRM: "No recovery is possible. The abort handler
> >> must disable the caches, communicate the fail directly
> >> with the external system, request a reboot."
> >> 
> >> Bit 10 no longer indicates an asynchronous (let alone
> >> imprecise) fault.  Apart from the debug events and async
> >> aborts (and possibly some implementation-defined aborts),
> >> all aborts listed are synchronous, and DFAR/IFAR is valid.
> >> There's no technical obstruction to make these trappable
> >> via the kernel exception handling mechanism. (Though at
> >> least in case of parity/ECC errors one shouldn't.)
> > 
> > Tony, Nishanth, or somebody else... can you help with memory
> > management? Or do you know some expert for arch/arm/mm/
> > code?
> 
> Folks in linux-arm-kernel are probably the right people, I
> suppose. Looping them in.

Hi folks in linux-arm-kernel!

Can you help us with above problem? How to catch external abort 
on non-linefetch in kernel driver and prevent kernel panic?

Here is that kernel panic log: 
http://thread.gmane.org/gmane.linux.ports.arm.omap/108397/

We want to check for "Unhandled fault: external abort on non-
linefetch" and if it happens disable some functionality in kernel 
driver omap-aes.ko

-- 
Pali Roh?r
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20150218/c3e32be8/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
       [not found]   ` <CAALWOA_ngoSKjB=ZQ264Va37bBK7v41Ei45SyoYLiMdanTKnxQ@mail.gmail.com>
       [not found]     ` <201502112128.44852@pali>
@ 2015-02-19 18:20     ` Pali Rohár
  2015-02-19 20:25       ` Matthijs van Duin
  2015-02-19 21:10       ` Aaro Koskinen
  1 sibling, 2 replies; 20+ messages in thread
From: Pali Rohár @ 2015-02-19 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 11 February 2015 16:22:51 Matthijs van Duin wrote:
> On 11 February 2015 at 13:39, Pali Roh?r <pali.rohar@gmail.com> wrote:
> >> Anyhow, since checking the firewalls/APs to see if you have
> >> permission will probably only get you yet another fault if
> >> things are walled off, the robust way of dealing with this
> >> sort of situation is by probing the device with a read
> >> while trapping bus faults. This also handles modules that
> >> are unreachable for other reasons, e.g. being disabled by
> >> eFuse.
> > 
> > It is possible to patch kernel code to mask or ignore that
> > fault? Can you help me with something like that?
> 
> As I mentioned, I'm still learning my way around the kernel,
> so I don't feel very comfortable suggesting a concrete patch
> just yet. I've been browsing arch/arm/mm/ however and my
> impression is that all that would be required is editing
> fault.c by making a copy of do_bad but containing
>     return user_mode(regs) || !fixup_exception(regs);
> and hook it onto the appropriate fault codes.  However, this
> really needs the opinion of someone more familiar with this
> code.
> 
> I do have an observation to make on the issue of fault
> decoding: the list in fsr-2level.c may be "standard ARMv3 and
> ARMv4 aborts" but they are quite wrong for ARMv7 which has:
> 
> [ 0] -
> [ 1] alignment fault
> [ 2] debug event
> [ 3] section access flag fault
> [ 4] instruction cache maintainance fault (reported via data
> abort) [ 5] section translation fault
> [ 6] page access flag fault
> [ 7] page translation fault
> [ 8] bus error on access
> [ 9] section domain fault
> [10] -
> [11] page domain fault
> [12] bus error on section table walk
> [13] section permission fault
> [14] bus error on page table walk
> [15] page permission fault
> [16] (TLB conflict abort)
> [17] -
> [18] -
> [19] -
> [20] (lockdown abort)
> [21] -
> [22] async bus error (reported via data abort)
> [23] -
> [24] async parity/ECC error (reported via data abort)
> [25] parity/ECC error on access
> [26] (coprocessor abort)
> [27] -
> [28] parity/ECC error on section table walk
> [29] -
> [30] parity/ECC error on page table walk
> [31] -
> 
> Some entries are patched up near the bottom of fault.c but
> many bogus messages remain, for example the "on linefetch" vs
> "on non-linefetch" is misleading since no such thing can be
> inferred from the fault status on v7.  Also, the i-cache
> maintenance fault handling looks wrong to me: it should fetch
> the actual fault status from IFSR (even though the address
> still comes from DFSR) and dispatch based on that.
> 
> Async external aborts (async bus error and async parity/ECC
> error) give you basically no info. DFAR will contain garbage
> hence displaying it will confuse rather than enlighten, a
> traceback is pointless since the instruction that caused the
> access is long retired, likewise user_mode() doesn't matter
> since a transition to kernel space may have happened after
> the access that cause the abort. Basically they should be
> treated more as an IRQ than as a fault (note they can also be
> masked just like irqs). In case of a bus error, it may be
> appropriate to just warn about it, or perhaps send a signal
> to the current process, although in the latter case it should
> have some means to distinguish it from a synchronous bus
> error.
> 
> At least on the cortex-a8, a parity/ECC error (whether async
> or not) is to be regarded as absolutely fatal.  Quoth the
> TRM: "No recovery is possible. The abort handler must disable
> the caches, communicate the fail directly with the external
> system, request a reboot."
> 
> Bit 10 no longer indicates an asynchronous (let alone
> imprecise) fault.  Apart from the debug events and async
> aborts (and possibly some implementation-defined aborts), all
> aborts listed are synchronous, and DFAR/IFAR is valid.
> There's no technical obstruction to make these trappable via
> the kernel exception handling mechanism. (Though at least in
> case of parity/ECC errors one shouldn't.)

Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch:

diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c..d530d55 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -7,7 +7,12 @@ static struct fsr_info fsr_info[] = {
 	{ do_bad,		SIGBUS,	 BUS_ADRALN,	"alignment exception"		   },
 	{ do_bad,		SIGKILL, 0,		"terminal exception"		   },
 	{ do_bad,		SIGBUS,	 BUS_ADRALN,	"alignment exception"		   },
+/* Do we need runtime check ? */
+#if __LINUX_ARM_ARCH__ < 6
 	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
+#else
+	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"I-cache maintenance fault"	   },
+#endif
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"section translation fault"	   },
 	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
 	{ do_page_fault,	SIGSEGV, SEGV_MAPERR,	"page translation fault"	   },

Maybe it is related?

-- 
Pali Roh?r
pali.rohar@gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20150219/dcb50f41/attachment.sig>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-02-19 18:20     ` Pali Rohár
@ 2015-02-19 20:25       ` Matthijs van Duin
  2015-02-19 21:10       ` Aaro Koskinen
  1 sibling, 0 replies; 20+ messages in thread
From: Matthijs van Duin @ 2015-02-19 20:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 February 2015 at 22:14, Pali Roh?r <pali.rohar@gmail.com> wrote:
> Can you help us with above problem? How to catch external abort
> on non-linefetch in kernel driver and prevent kernel panic?

Actually it's a synchronous bus error that you want to catch, which
however is misreported by linux as "external abort on non-linefetch"
(... but a bus error on a linefetch would produce exactly the same
error).  Also, ARM apparently uses the term "external abort" as
umbrella term for aborts triggered outside the MMU, which includes not
just bus errors but also (uncorrectable) parity/ECC errors.

Anyhow, the core question you mean to ask is: can the "exception"
mechanism current already in place to trap MMU faults in e.g.
put_user() easily be extended to allow drivers to trap synchronous bus
errors?  My impression is that this would in fact be quite easy and I
even outlined a suggested patch, but I'm still a kernel newbie so I
may be way off course.

Although its main use would be for auto-probing, it's maybe worth
mentioning I've met at least one peripheral which also reports bus
errors when writing inappropriate/unsupported *values* to a register.
(Of course when using posted writes you won't get an abort anyhow in
that case, it's only reported via interconnect error logs.)

On 19 February 2015 at 19:20, Pali Roh?r <pali.rohar@gmail.com> wrote:
> Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch

In mainline linux the same fix-up is done at runtime rather than
compile time (in exceptions_init() at the bottom of fault.c). Either
way, in my post of the 11th I also mentioned that it looks wrong to
me. I-cache maintenance fault is really a special case in the fault
decoding logic since it means "although you got here via DAbort and
the relevant address is in DFAR, the exception happened on the
instruction side so you need to fetch the fault status from IFSR
instead."

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-02-19 18:20     ` Pali Rohár
  2015-02-19 20:25       ` Matthijs van Duin
@ 2015-02-19 21:10       ` Aaro Koskinen
  1 sibling, 0 replies; 20+ messages in thread
From: Aaro Koskinen @ 2015-02-19 21:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Feb 19, 2015 at 07:20:41PM +0100, Pali Roh?r wrote:
> Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch:

> +/* Do we need runtime check ? */
> +#if __LINUX_ARM_ARCH__ < 6
>  	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
> +#else
> +	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"I-cache maintenance fault"	   },
> +#endif

> Maybe it is related?

That was unrelated. Also, the patch is also in mainline,
see 8c0b742ca7a7d21de0ddc87eda6ef0b282e4de18 (ARM: 6134/1: Handle
instruction cache maintenance fault properly).

A.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-02-11 20:40       ` runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot) Nishanth Menon
  2015-02-18 21:14         ` Pali Rohár
@ 2015-05-28  7:37         ` Pali Rohár
  2015-05-28 16:01           ` Tony Lindgren
  1 sibling, 1 reply; 20+ messages in thread
From: Pali Rohár @ 2015-05-28  7:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 11 February 2015 14:40:33 Nishanth Menon wrote:
> On Wed, Feb 11, 2015 at 2:28 PM, Pali Roh?r <pali.rohar@gmail.com> wrote:
> > On Wednesday 11 February 2015 16:22:51 Matthijs van Duin wrote:
> >> On 11 February 2015 at 13:39, Pali Roh?r <pali.rohar@gmail.com>
> > wrote:
> >> >> Anyhow, since checking the firewalls/APs to see if you have
> >> >> permission will probably only get you yet another fault if
> >> >> things are walled off, the robust way of dealing with this
> >> >> sort of situation is by probing the device with a read
> >> >> while trapping bus faults. This also handles modules that
> >> >> are unreachable for other reasons, e.g. being disabled by
> >> >> eFuse.
> >> >
> >> > It is possible to patch kernel code to mask or ignore that
> >> > fault? Can you help me with something like that?
> >>
> >> As I mentioned, I'm still learning my way around the kernel,
> >> so I don't feel very comfortable suggesting a concrete patch
> >> just yet. I've been browsing arch/arm/mm/ however and my
> >> impression is that all that would be required is editing
> >> fault.c by making a copy of do_bad but containing
> >>     return user_mode(regs) || !fixup_exception(regs);
> >> and hook it onto the appropriate fault codes.  However, this
> >> really needs the opinion of someone more familiar with this
> >> code.
> >>
> >> I do have an observation to make on the issue of fault
> >> decoding: the list in fsr-2level.c may be "standard ARMv3 and
> >> ARMv4 aborts" but they are quite wrong for ARMv7 which has:
> >>
> >> [ 0] -
> >> [ 1] alignment fault
> >> [ 2] debug event
> >> [ 3] section access flag fault
> >> [ 4] instruction cache maintainance fault (reported via data
> >> abort) [ 5] section translation fault
> >> [ 6] page access flag fault
> >> [ 7] page translation fault
> >> [ 8] bus error on access
> >> [ 9] section domain fault
> >> [10] -
> >> [11] page domain fault
> >> [12] bus error on section table walk
> >> [13] section permission fault
> >> [14] bus error on page table walk
> >> [15] page permission fault
> >> [16] (TLB conflict abort)
> >> [17] -
> >> [18] -
> >> [19] -
> >> [20] (lockdown abort)
> >> [21] -
> >> [22] async bus error (reported via data abort)
> >> [23] -
> >> [24] async parity/ECC error (reported via data abort)
> >> [25] parity/ECC error on access
> >> [26] (coprocessor abort)
> >> [27] -
> >> [28] parity/ECC error on section table walk
> >> [29] -
> >> [30] parity/ECC error on page table walk
> >> [31] -
> >>
> >> Some entries are patched up near the bottom of fault.c but
> >> many bogus messages remain, for example the "on linefetch" vs
> >> "on non-linefetch" is misleading since no such thing can be
> >> inferred from the fault status on v7.  Also, the i-cache
> >> maintenance fault handling looks wrong to me: it should fetch
> >> the actual fault status from IFSR (even though the address
> >> still comes from DFSR) and dispatch based on that.
> >>
> >> Async external aborts (async bus error and async parity/ECC
> >> error) give you basically no info. DFAR will contain garbage
> >> hence displaying it will confuse rather than enlighten, a
> >> traceback is pointless since the instruction that caused the
> >> access is long retired, likewise user_mode() doesn't matter
> >> since a transition to kernel space may have happened after
> >> the access that cause the abort. Basically they should be
> >> treated more as an IRQ than as a fault (note they can also be
> >> masked just like irqs). In case of a bus error, it may be
> >> appropriate to just warn about it, or perhaps send a signal
> >> to the current process, although in the latter case it should
> >> have some means to distinguish it from a synchronous bus
> >> error.
> >>
> >> At least on the cortex-a8, a parity/ECC error (whether async
> >> or not) is to be regarded as absolutely fatal.  Quoth the
> >> TRM: "No recovery is possible. The abort handler must disable
> >> the caches, communicate the fail directly with the external
> >> system, request a reboot."
> >>
> >> Bit 10 no longer indicates an asynchronous (let alone
> >> imprecise) fault.  Apart from the debug events and async
> >> aborts (and possibly some implementation-defined aborts), all
> >> aborts listed are synchronous, and DFAR/IFAR is valid.
> >> There's no technical obstruction to make these trappable via
> >> the kernel exception handling mechanism. (Though at least in
> >> case of parity/ECC errors one shouldn't.)
> >
> > Tony, Nishanth, or somebody else... can you help with memory
> > management? Or do you know some expert for arch/arm/mm/ code?
> 
> Folks in linux-arm-kernel are probably the right people, I suppose.
> Looping them in.
> 

So pinging linux-arm-kernel again. Any idea how to handle that fault?

-- 
Pali Roh?r
pali.rohar at gmail.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28  7:37         ` Pali Rohár
@ 2015-05-28 16:01           ` Tony Lindgren
  2015-05-28 20:26             ` Matthijs van Duin
  0 siblings, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2015-05-28 16:01 UTC (permalink / raw)
  To: linux-arm-kernel

* Pali Roh?r <pali.rohar@gmail.com> [150528 00:39]:
> On Wednesday 11 February 2015 14:40:33 Nishanth Menon wrote:
> > On Wed, Feb 11, 2015 at 2:28 PM, Pali Roh?r <pali.rohar@gmail.com> wrote:
> > > On Wednesday 11 February 2015 16:22:51 Matthijs van Duin wrote:
> > >> On 11 February 2015 at 13:39, Pali Roh?r <pali.rohar@gmail.com>
> > > wrote:
> > >> >> Anyhow, since checking the firewalls/APs to see if you have
> > >> >> permission will probably only get you yet another fault if
> > >> >> things are walled off, the robust way of dealing with this
> > >> >> sort of situation is by probing the device with a read
> > >> >> while trapping bus faults. This also handles modules that
> > >> >> are unreachable for other reasons, e.g. being disabled by
> > >> >> eFuse.
> > >> >
> > >> > It is possible to patch kernel code to mask or ignore that
> > >> > fault? Can you help me with something like that?
> > >>
> > >> As I mentioned, I'm still learning my way around the kernel,
> > >> so I don't feel very comfortable suggesting a concrete patch
> > >> just yet. I've been browsing arch/arm/mm/ however and my
> > >> impression is that all that would be required is editing
> > >> fault.c by making a copy of do_bad but containing
> > >>     return user_mode(regs) || !fixup_exception(regs);
> > >> and hook it onto the appropriate fault codes.  However, this
> > >> really needs the opinion of someone more familiar with this
> > >> code.
> > >>
> > >> I do have an observation to make on the issue of fault
> > >> decoding: the list in fsr-2level.c may be "standard ARMv3 and
> > >> ARMv4 aborts" but they are quite wrong for ARMv7 which has:
> > >>
> > >> [ 0] -
> > >> [ 1] alignment fault
> > >> [ 2] debug event
> > >> [ 3] section access flag fault
> > >> [ 4] instruction cache maintainance fault (reported via data
> > >> abort) [ 5] section translation fault
> > >> [ 6] page access flag fault
> > >> [ 7] page translation fault
> > >> [ 8] bus error on access
> > >> [ 9] section domain fault
> > >> [10] -
> > >> [11] page domain fault
> > >> [12] bus error on section table walk
> > >> [13] section permission fault
> > >> [14] bus error on page table walk
> > >> [15] page permission fault
> > >> [16] (TLB conflict abort)
> > >> [17] -
> > >> [18] -
> > >> [19] -
> > >> [20] (lockdown abort)
> > >> [21] -
> > >> [22] async bus error (reported via data abort)
> > >> [23] -
> > >> [24] async parity/ECC error (reported via data abort)
> > >> [25] parity/ECC error on access
> > >> [26] (coprocessor abort)
> > >> [27] -
> > >> [28] parity/ECC error on section table walk
> > >> [29] -
> > >> [30] parity/ECC error on page table walk
> > >> [31] -
> > >>
> > >> Some entries are patched up near the bottom of fault.c but
> > >> many bogus messages remain, for example the "on linefetch" vs
> > >> "on non-linefetch" is misleading since no such thing can be
> > >> inferred from the fault status on v7.  Also, the i-cache
> > >> maintenance fault handling looks wrong to me: it should fetch
> > >> the actual fault status from IFSR (even though the address
> > >> still comes from DFSR) and dispatch based on that.
> > >>
> > >> Async external aborts (async bus error and async parity/ECC
> > >> error) give you basically no info. DFAR will contain garbage
> > >> hence displaying it will confuse rather than enlighten, a
> > >> traceback is pointless since the instruction that caused the
> > >> access is long retired, likewise user_mode() doesn't matter
> > >> since a transition to kernel space may have happened after
> > >> the access that cause the abort. Basically they should be
> > >> treated more as an IRQ than as a fault (note they can also be
> > >> masked just like irqs). In case of a bus error, it may be
> > >> appropriate to just warn about it, or perhaps send a signal
> > >> to the current process, although in the latter case it should
> > >> have some means to distinguish it from a synchronous bus
> > >> error.
> > >>
> > >> At least on the cortex-a8, a parity/ECC error (whether async
> > >> or not) is to be regarded as absolutely fatal.  Quoth the
> > >> TRM: "No recovery is possible. The abort handler must disable
> > >> the caches, communicate the fail directly with the external
> > >> system, request a reboot."
> > >>
> > >> Bit 10 no longer indicates an asynchronous (let alone
> > >> imprecise) fault.  Apart from the debug events and async
> > >> aborts (and possibly some implementation-defined aborts), all
> > >> aborts listed are synchronous, and DFAR/IFAR is valid.
> > >> There's no technical obstruction to make these trappable via
> > >> the kernel exception handling mechanism. (Though at least in
> > >> case of parity/ECC errors one shouldn't.)
> > >
> > > Tony, Nishanth, or somebody else... can you help with memory
> > > management? Or do you know some expert for arch/arm/mm/ code?
> > 
> > Folks in linux-arm-kernel are probably the right people, I suppose.
> > Looping them in.
> > 
> 
> So pinging linux-arm-kernel again. Any idea how to handle that fault?

Here's what might work.. You could patch drivers/bus/omap_l3*.c
code to probe the devices after the omap_l3 driver interrupts
are enabled.

For failed device access you get an interrupt so you know to not
create the struct device entry for that device. For the working
devices you can do the struct device entry and let it probe.

So basically we could make the omap_l3* drivers managers for
the omap bus code instead of probing them with "simple-bus"
and omap_device_build_from_dt().

No need to have these device probe early, and they are all
internal devices so as long as we know the type and address
for each soc the omap_l3 drive code could probe them.

It seems that trying to do this early just makes things more
complicated and should be done in the bootloader instead of
kernel if needed early.

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28 16:01           ` Tony Lindgren
@ 2015-05-28 20:26             ` Matthijs van Duin
  2015-05-28 22:24               ` Tony Lindgren
  0 siblings, 1 reply; 20+ messages in thread
From: Matthijs van Duin @ 2015-05-28 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 28 May 2015 at 18:01, Tony Lindgren <tony@atomide.com> wrote:
> For failed device access you get an interrupt

Well for failed reads you get a bus error, and "catching" those (e.g.
using the existing exception mechanism used to catch MMU faults) is
the whole issue.

Though now that you mention it, it is true that for writes you won't
get any fault (at least on the DM814x and AM335x the posting point
appears to be the async bridge from MPUSS to the L3 interconnect) but
an interconnect error irq instead. It may be easier to make some kind
of harmless write (e.g. to the version register), wait a bit, and
check if the write triggered an interconnect error.

Feels hackish though: you'd need to be sure you waited long enough
(though using a read from another device on the same L4 interconnect
should be a reliable barrier in this case), and drivers for
receiving/interpreting interconnect errors are not implemented yet on
all SoCs (for some, like the AM335x, TI didn't even bother publishing
the relevant data in its TRM). Interconnect errors can also be lost in
some cases (multiple errors involving the same target in a short time
window) though that problem shouldn't arise in this particular case.

Also, presumably interconnect error reporting is unavailable on HS
devices given the fact that all interconnect registers seemed to be
inaccessible?

Matthijs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28 20:26             ` Matthijs van Duin
@ 2015-05-28 22:24               ` Tony Lindgren
  2015-05-28 22:27                 ` Pali Rohár
  2015-05-29  0:58                 ` Matthijs van Duin
  0 siblings, 2 replies; 20+ messages in thread
From: Tony Lindgren @ 2015-05-28 22:24 UTC (permalink / raw)
  To: linux-arm-kernel

* Matthijs van Duin <matthijsvanduin@gmail.com> [150528 13:28]:
> On 28 May 2015 at 18:01, Tony Lindgren <tony@atomide.com> wrote:
> > For failed device access you get an interrupt
> 
> Well for failed reads you get a bus error, and "catching" those (e.g.
> using the existing exception mechanism used to catch MMU faults) is
> the whole issue.
> 
> Though now that you mention it, it is true that for writes you won't
> get any fault (at least on the DM814x and AM335x the posting point
> appears to be the async bridge from MPUSS to the L3 interconnect) but
> an interconnect error irq instead. It may be easier to make some kind
> of harmless write (e.g. to the version register), wait a bit, and
> check if the write triggered an interconnect error.
> 
> Feels hackish though: you'd need to be sure you waited long enough
> (though using a read from another device on the same L4 interconnect
> should be a reliable barrier in this case), and drivers for
> receiving/interpreting interconnect errors are not implemented yet on
> all SoCs (for some, like the AM335x, TI didn't even bother publishing
> the relevant data in its TRM). Interconnect errors can also be lost in
> some cases (multiple errors involving the same target in a short time
> window) though that problem shouldn't arise in this particular case.

Hmm I believe the interrupt happens immediately trying to access an
invalid device. But maybe I'm thinking about just errors if a device
is not powered or clocked. So obviously some experiments need to be
done :)

The advantage here would be that the l3 driver actually already knows
quite a bit about the devices on the bus.
 
> Also, presumably interconnect error reporting is unavailable on HS
> devices given the fact that all interconnect registers seemed to be
> inaccessible?

Oh OK yeah then that would not work for Pali's case. I guess it just
needs to be tested.

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28 22:24               ` Tony Lindgren
@ 2015-05-28 22:27                 ` Pali Rohár
  2015-05-29  0:15                   ` Tony Lindgren
  2015-05-29  0:58                 ` Matthijs van Duin
  1 sibling, 1 reply; 20+ messages in thread
From: Pali Rohár @ 2015-05-28 22:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 29 May 2015 00:24:13 Tony Lindgren wrote:
> * Matthijs van Duin <matthijsvanduin@gmail.com> [150528 13:28]:
> > On 28 May 2015 at 18:01, Tony Lindgren <tony@atomide.com> wrote:
> > > For failed device access you get an interrupt
> > 
> > Well for failed reads you get a bus error, and "catching" those
> > (e.g. using the existing exception mechanism used to catch MMU
> > faults) is the whole issue.
> > 
> > Though now that you mention it, it is true that for writes you
> > won't get any fault (at least on the DM814x and AM335x the posting
> > point appears to be the async bridge from MPUSS to the L3
> > interconnect) but an interconnect error irq instead. It may be
> > easier to make some kind of harmless write (e.g. to the version
> > register), wait a bit, and check if the write triggered an
> > interconnect error.
> > 
> > Feels hackish though: you'd need to be sure you waited long enough
> > (though using a read from another device on the same L4
> > interconnect should be a reliable barrier in this case), and
> > drivers for receiving/interpreting interconnect errors are not
> > implemented yet on all SoCs (for some, like the AM335x, TI didn't
> > even bother publishing the relevant data in its TRM). Interconnect
> > errors can also be lost in some cases (multiple errors involving
> > the same target in a short time window) though that problem
> > shouldn't arise in this particular case.
> 
> Hmm I believe the interrupt happens immediately trying to access an
> invalid device. But maybe I'm thinking about just errors if a device
> is not powered or clocked. So obviously some experiments need to be
> done :)
> 
> The advantage here would be that the l3 driver actually already knows
> quite a bit about the devices on the bus.
> 
> > Also, presumably interconnect error reporting is unavailable on HS
> > devices given the fact that all interconnect registers seemed to be
> > inaccessible?
> 
> Oh OK yeah then that would not work for Pali's case. I guess it just
> needs to be tested.
> 
> Regards,
> 
> Tony

Ok, thanks for info. Do you have some quick small patches for testing? 
Or some pointers what is needed to modify and how?

-- 
Pali Roh?r
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20150529/f1493a36/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28 22:27                 ` Pali Rohár
@ 2015-05-29  0:15                   ` Tony Lindgren
  0 siblings, 0 replies; 20+ messages in thread
From: Tony Lindgren @ 2015-05-29  0:15 UTC (permalink / raw)
  To: linux-arm-kernel

* Pali Roh?r <pali.rohar@gmail.com> [150528 15:29]:
> On Friday 29 May 2015 00:24:13 Tony Lindgren wrote:
> > * Matthijs van Duin <matthijsvanduin@gmail.com> [150528 13:28]:
> > > On 28 May 2015 at 18:01, Tony Lindgren <tony@atomide.com> wrote:
> > > > For failed device access you get an interrupt
> > > 
> > > Well for failed reads you get a bus error, and "catching" those
> > > (e.g. using the existing exception mechanism used to catch MMU
> > > faults) is the whole issue.
> > > 
> > > Though now that you mention it, it is true that for writes you
> > > won't get any fault (at least on the DM814x and AM335x the posting
> > > point appears to be the async bridge from MPUSS to the L3
> > > interconnect) but an interconnect error irq instead. It may be
> > > easier to make some kind of harmless write (e.g. to the version
> > > register), wait a bit, and check if the write triggered an
> > > interconnect error.
> > > 
> > > Feels hackish though: you'd need to be sure you waited long enough
> > > (though using a read from another device on the same L4
> > > interconnect should be a reliable barrier in this case), and
> > > drivers for receiving/interpreting interconnect errors are not
> > > implemented yet on all SoCs (for some, like the AM335x, TI didn't
> > > even bother publishing the relevant data in its TRM). Interconnect
> > > errors can also be lost in some cases (multiple errors involving
> > > the same target in a short time window) though that problem
> > > shouldn't arise in this particular case.
> > 
> > Hmm I believe the interrupt happens immediately trying to access an
> > invalid device. But maybe I'm thinking about just errors if a device
> > is not powered or clocked. So obviously some experiments need to be
> > done :)
> > 
> > The advantage here would be that the l3 driver actually already knows
> > quite a bit about the devices on the bus.
> > 
> > > Also, presumably interconnect error reporting is unavailable on HS
> > > devices given the fact that all interconnect registers seemed to be
> > > inaccessible?
> > 
> > Oh OK yeah then that would not work for Pali's case. I guess it just
> > needs to be tested.
> > 
> > Regards,
> > 
> > Tony
> 
> Ok, thanks for info. Do you have some quick small patches for testing? 
> Or some pointers what is needed to modify and how?

Well I guess the initial test would be to make sure you have
CONFIG_OMAP_INTERCONNECT=y, comment out status = "disabled" in
omap3-n900.dts for aes, patch in the aes hwmod data, check that
you have CONFIG_CRYPTO_DEV_OMAP_AES=y, boot the kernel.

Do you get just the l3_smx interrupt instead of the "Unhandled fault"?

If so then we can use the interrupt handle to make the probe fail.
Not sure yet what would be the best way to do that though :)

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-28 22:24               ` Tony Lindgren
  2015-05-28 22:27                 ` Pali Rohár
@ 2015-05-29  0:58                 ` Matthijs van Duin
  2015-05-29  1:35                   ` Matthijs van Duin
  1 sibling, 1 reply; 20+ messages in thread
From: Matthijs van Duin @ 2015-05-29  0:58 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2015 at 00:24, Tony Lindgren <tony@atomide.com> wrote:
> Hmm I believe the interrupt happens immediately trying to access an
> invalid device. But maybe I'm thinking about just errors if a device
> is not powered or clocked.

It is only guaranteed to happen immediately (before the next
instruction is executed) if the error occurs before the posting-point
of the write. However, in that case the error is reported in-band to
the cpu, resulting in a (synchronous) bus error which takes precedence
over the out-of-band error irq (if any is signalled). Once the write
is posted however, the cpu will receive an ack on the write and
continue execution, and there's no reason to assume that an error irq
will happen *immediately* after the write.

Of course it typically will happen soon afterwards, possibly even
before the next instruction is executed, depending a bit on how soon
after the posting-point the error occurs versus how long it takes for
the write-ack to reach the cpu. On the other hand, it's also possible
the write, after becoming posted, gets stuck for a while due to a
burst of higher-priority traffic. (I also recall reading about some
situation where a request needs to wait for something to be
dynamically powered up before an error response could be generated,
but I think that was on the OMAP 4.)

So that's the icky part: it will very likely happen almost
immediately. There's however no *guarantee* that it will, and in fact
it's quite tricky to absolutely make sure a write is no longer in
transit. The usual solution is an "OCP barrier": a read that is known
to follow the same path as the write. Normally that means a read from
the same peripheral, but that would defeat the purpose in this case.
Fortunately, the L4 interconnects (unlike the L3) detect firewall
violations in the initiator agent rather than the target agents, hence
a read from any peripheral on the same L4 interconnect suffices.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-29  0:58                 ` Matthijs van Duin
@ 2015-05-29  1:35                   ` Matthijs van Duin
  2015-05-29 15:50                     ` Tony Lindgren
  0 siblings, 1 reply; 20+ messages in thread
From: Matthijs van Duin @ 2015-05-29  1:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2015 at 02:58, Matthijs van Duin <matthijsvanduin@gmail.com> wrote:
> It is only guaranteed to happen immediately (before the next
> instruction is executed) if the error occurs before the posting-point
> of the write. However, in that case the error is reported in-band to
> the cpu, resulting in a (synchronous) bus error which takes precedence
> over the out-of-band error irq (if any is signalled).

OK, all this was actually assuming linux uses device-type mappings for
device mappings, which was also the impression I got from
build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
maze). A quick test however seems to imply otherwise:

~# ./bogus-dev-write
Bus error

So... linux actually uses strongly-ordered mappings? I really didn't
expect that, given the performance implications (especially on a
strictly in-order cpu like the Cortex-A8 which will really just sit
there picking its nose until the write completes) and I think I recall
having seen an OCP barrier being used somewhere in driver code...

Well, in that case everything I said is technically still true, except
the posting point is the peripheral itself. That also means the
interconnect error reporting mechanism is not really useful for
probing since you'll get a bus error before any error irq is
delivered.

So I'd say you're back at having to trap that bus error using the
exception handling mechanism, which I still suspect shouldn't be hard
to do.

Or perhaps you could probe the device using a DMA access and combine
that with the interconnect error reporting irq... ;-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-29  1:35                   ` Matthijs van Duin
@ 2015-05-29 15:50                     ` Tony Lindgren
  2015-05-29 18:16                       ` Tony Lindgren
  2015-05-30 15:22                       ` Matthijs van Duin
  0 siblings, 2 replies; 20+ messages in thread
From: Tony Lindgren @ 2015-05-29 15:50 UTC (permalink / raw)
  To: linux-arm-kernel

* Matthijs van Duin <matthijsvanduin@gmail.com> [150528 18:37]:
> On 29 May 2015 at 02:58, Matthijs van Duin <matthijsvanduin@gmail.com> wrote:
> > It is only guaranteed to happen immediately (before the next
> > instruction is executed) if the error occurs before the posting-point
> > of the write. However, in that case the error is reported in-band to
> > the cpu, resulting in a (synchronous) bus error which takes precedence
> > over the out-of-band error irq (if any is signalled).
> 
> OK, all this was actually assuming linux uses device-type mappings for
> device mappings, which was also the impression I got from
> build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
> maze). A quick test however seems to imply otherwise:
> 
> ~# ./bogus-dev-write
> Bus error
> 
> So... linux actually uses strongly-ordered mappings? I really didn't
> expect that, given the performance implications (especially on a
> strictly in-order cpu like the Cortex-A8 which will really just sit
> there picking its nose until the write completes) and I think I recall
> having seen an OCP barrier being used somewhere in driver code...

I believe some TI kernels use strongly-ordered mappings, mainline
kernel does not. Which kernel version are you using?
 
> Well, in that case everything I said is technically still true, except
> the posting point is the peripheral itself. That also means the
> interconnect error reporting mechanism is not really useful for
> probing since you'll get a bus error before any error irq is
> delivered.

Hmm if that's the case then yes we can't use the error irq. However,
what I've seen so far is that we only get the bus error if the
l3_* drivers are configured. I guess some more testing is needed.
 
> So I'd say you're back at having to trap that bus error using the
> exception handling mechanism, which I still suspect shouldn't be hard
> to do.

And in that case it makes sense to do that in the bootloader to
avoid adding any custom early boot code to Linux kernel.
 
> Or perhaps you could probe the device using a DMA access and combine
> that with the interconnect error reporting irq... ;-)

Heh too many dependencies :)

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-29 15:50                     ` Tony Lindgren
@ 2015-05-29 18:16                       ` Tony Lindgren
  2015-05-30 15:22                       ` Matthijs van Duin
  1 sibling, 0 replies; 20+ messages in thread
From: Tony Lindgren @ 2015-05-29 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

* Tony Lindgren <tony@atomide.com> [150529 08:52]:
> * Matthijs van Duin <matthijsvanduin@gmail.com> [150528 18:37]:
> > On 29 May 2015 at 02:58, Matthijs van Duin <matthijsvanduin@gmail.com> wrote:
> > > It is only guaranteed to happen immediately (before the next
> > > instruction is executed) if the error occurs before the posting-point
> > > of the write. However, in that case the error is reported in-band to
> > > the cpu, resulting in a (synchronous) bus error which takes precedence
> > > over the out-of-band error irq (if any is signalled).
> > 
> > OK, all this was actually assuming linux uses device-type mappings for
> > device mappings, which was also the impression I got from
> > build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
> > maze). A quick test however seems to imply otherwise:
> > 
> > ~# ./bogus-dev-write
> > Bus error
> > 
> > So... linux actually uses strongly-ordered mappings? I really didn't
> > expect that, given the performance implications (especially on a
> > strictly in-order cpu like the Cortex-A8 which will really just sit
> > there picking its nose until the write completes) and I think I recall
> > having seen an OCP barrier being used somewhere in driver code...
> 
> I believe some TI kernels use strongly-ordered mappings, mainline
> kernel does not. Which kernel version are you using?
>  
> > Well, in that case everything I said is technically still true, except
> > the posting point is the peripheral itself. That also means the
> > interconnect error reporting mechanism is not really useful for
> > probing since you'll get a bus error before any error irq is
> > delivered.
> 
> Hmm if that's the case then yes we can't use the error irq. However,
> what I've seen so far is that we only get the bus error if the
> l3_* drivers are configured. I guess some more testing is needed.
>  
> > So I'd say you're back at having to trap that bus error using the
> > exception handling mechanism, which I still suspect shouldn't be hard
> > to do.
> 
> And in that case it makes sense to do that in the bootloader to
> avoid adding any custom early boot code to Linux kernel.
>  
> > Or perhaps you could probe the device using a DMA access and combine
> > that with the interconnect error reporting irq... ;-)
> 
> Heh too many dependencies :)

If we can't use the l3 interrrupts, then something similar to commit
fdf4850cb5b2 ("ARM: BCM5301X: workaround suppress fault") might be
doable too.

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-29 15:50                     ` Tony Lindgren
  2015-05-29 18:16                       ` Tony Lindgren
@ 2015-05-30 15:22                       ` Matthijs van Duin
  2015-06-01 17:58                         ` Tony Lindgren
  1 sibling, 1 reply; 20+ messages in thread
From: Matthijs van Duin @ 2015-05-30 15:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2015 at 17:50, Tony Lindgren <tony@atomide.com> wrote:
> I believe some TI kernels use strongly-ordered mappings, mainline
> kernel does not. Which kernel version are you using?

Normally I periodically rebuild based on Robert C Nelson's -bone
kernel (but with heavily customized config). I also tried a plain
4.1.0-rc5-bone3, the generic 4.1.0-rc5-armv7-x0 (the most
vanilla-looking kernel I could find in my debian package list), and
for the heck of it also the classic 3.14.43-ti-r66.

In all cases I observed a synchronous bus error (dubiously reported as
"external abort on non-linefetch (0x1818)") on an AM335x with this
trivial test:

int main() {
        int fd = open( "/dev/mem", O_RDWR | O_DSYNC );
        if( fd < 0 ) return 1;
        void *ptr = mmap( NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0x42000000 );
        if( ptr == MAP_FAILED ) return 1;
        *(volatile int *)ptr = 0;
        return 0;
}

I even considered for a moment that maybe the AM335x has some "all
writes non-posted" thing enabled (which I think is available as a
switch on OMAP 4/5?). It seemed unlikely, but since most of my
exploration of interconnect behaviour was done on a DM814x, I
double-checked by performing the same write in a baremetal test
program (with that region configured Device-type in the MMU). As
expected, no data abort occurred, so writes most certainly are posted.

So I have trouble coming up with any explanation for this other than
the use of strongly-ordered mappings.

(Curiously BTW, omitting O_DSYNC made no difference.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-05-30 15:22                       ` Matthijs van Duin
@ 2015-06-01 17:58                         ` Tony Lindgren
  2015-06-01 20:32                           ` Matthijs van Duin
  0 siblings, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2015-06-01 17:58 UTC (permalink / raw)
  To: linux-arm-kernel

* Matthijs van Duin <matthijsvanduin@gmail.com> [150530 08:24]:
> On 29 May 2015 at 17:50, Tony Lindgren <tony@atomide.com> wrote:
> > I believe some TI kernels use strongly-ordered mappings, mainline
> > kernel does not. Which kernel version are you using?
> 
> Normally I periodically rebuild based on Robert C Nelson's -bone
> kernel (but with heavily customized config). I also tried a plain
> 4.1.0-rc5-bone3, the generic 4.1.0-rc5-armv7-x0 (the most
> vanilla-looking kernel I could find in my debian package list), and
> for the heck of it also the classic 3.14.43-ti-r66.
> 
> In all cases I observed a synchronous bus error (dubiously reported as
> "external abort on non-linefetch (0x1818)") on an AM335x with this
> trivial test:
> 
> int main() {
>         int fd = open( "/dev/mem", O_RDWR | O_DSYNC );
>         if( fd < 0 ) return 1;
>         void *ptr = mmap( NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0x42000000 );
>         if( ptr == MAP_FAILED ) return 1;
>         *(volatile int *)ptr = 0;
>         return 0;
> }
> 
> I even considered for a moment that maybe the AM335x has some "all
> writes non-posted" thing enabled (which I think is available as a
> switch on OMAP 4/5?). It seemed unlikely, but since most of my
> exploration of interconnect behaviour was done on a DM814x, I
> double-checked by performing the same write in a baremetal test
> program (with that region configured Device-type in the MMU). As
> expected, no data abort occurred, so writes most certainly are posted.
> 
> So I have trouble coming up with any explanation for this other than
> the use of strongly-ordered mappings.
> 
> (Curiously BTW, omitting O_DSYNC made no difference.)

I think these kernels are missing the configuration for l3-noc
driver?

I tried it on omap4 that has l3-noc configured, and it first produces
"Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
and the L3 interrupt only after that. So yeah, you're right, we can't
use the interrupts here. I somehow remembered we'd get only the L3
interrupt if configured.

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-06-01 17:58                         ` Tony Lindgren
@ 2015-06-01 20:32                           ` Matthijs van Duin
  2015-06-01 20:52                             ` Tony Lindgren
  0 siblings, 1 reply; 20+ messages in thread
From: Matthijs van Duin @ 2015-06-01 20:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 1 June 2015 at 19:58, Tony Lindgren <tony@atomide.com> wrote:
> I think these kernels are missing the configuration for l3-noc
> driver?

Yup. Since I'm pretty sure I have all the necessary info I was hoping
look into that... somewhere in my copious spare time...

> I tried it on omap4 that has l3-noc configured, and it first produces
> "Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",

(Though making a patch to fix that annoyingly wrong and useless
message is higher on my list of priorities)

> and the L3 interrupt only after that. So yeah, you're right, we can't
> use the interrupts here. I somehow remembered we'd get only the L3
> interrupt if configured.

The bus error is not influenced by L3 error reporting config afaik,
and it will always win from the irq: even though the irq is almost
certainly asserted first, it can't be taken until the load/store
instruction completes, and then the fault will take precedence.

While implementing L3 error reporting in my forth system I ran into a
tricky scenario though: it turns out that if an irq occurs while the
cpu is waiting for instruction fetch, it does allow the irq to be
taken. The interrupted fetch is abandoned and any bus error it may
have produced is ignored since exception entry/exit is an implicit
instruction sync barrier. On return it is simply refetched...

Hence, the result from attempting to execute code from an invalid address:
  fetching from [invalid]
    irq entry (LR=[invalid])
    L3 error displayed
    irq exit
  fetching from [invalid]
    irq entry (LR=[invalid])
    L3 error displayed
    irq exit
  fetching from [invalid]
    ...
(repeat until watchdog expires)

Anyhow, so we still have the puzzling fact that apparently neither of
us was expecting device memory to use a strongly-ordered mapping,
getting a bus error on a write (outside MPUSS itself) shows that it
does.

I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
feeling hopelessly lost there... (the multitude of ARM architecture
versions/flavors supported aren't helping.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-06-01 20:32                           ` Matthijs van Duin
@ 2015-06-01 20:52                             ` Tony Lindgren
  2015-06-02  4:21                               ` Matthijs van Duin
  0 siblings, 1 reply; 20+ messages in thread
From: Tony Lindgren @ 2015-06-01 20:52 UTC (permalink / raw)
  To: linux-arm-kernel

* Matthijs van Duin <matthijsvanduin@gmail.com> [150601 13:34]:
> On 1 June 2015 at 19:58, Tony Lindgren <tony@atomide.com> wrote:
> > I think these kernels are missing the configuration for l3-noc
> > driver?
> 
> Yup. Since I'm pretty sure I have all the necessary info I was hoping
> look into that... somewhere in my copious spare time...
> 
> > I tried it on omap4 that has l3-noc configured, and it first produces
> > "Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
> 
> (Though making a patch to fix that annoyingly wrong and useless
> message is higher on my list of priorities)
> 
> > and the L3 interrupt only after that. So yeah, you're right, we can't
> > use the interrupts here. I somehow remembered we'd get only the L3
> > interrupt if configured.
> 
> The bus error is not influenced by L3 error reporting config afaik,
> and it will always win from the irq: even though the irq is almost
> certainly asserted first, it can't be taken until the load/store
> instruction completes, and then the fault will take precedence.
> 
> While implementing L3 error reporting in my forth system I ran into a
> tricky scenario though: it turns out that if an irq occurs while the
> cpu is waiting for instruction fetch, it does allow the irq to be
> taken. The interrupted fetch is abandoned and any bus error it may
> have produced is ignored since exception entry/exit is an implicit
> instruction sync barrier. On return it is simply refetched...
> 
> Hence, the result from attempting to execute code from an invalid address:
>   fetching from [invalid]
>     irq entry (LR=[invalid])
>     L3 error displayed
>     irq exit
>   fetching from [invalid]
>     irq entry (LR=[invalid])
>     L3 error displayed
>     irq exit
>   fetching from [invalid]
>     ...
> (repeat until watchdog expires)

OK that must be the case I've seen then. Probably that happens
when a device is not clocked. 
 
> Anyhow, so we still have the puzzling fact that apparently neither of
> us was expecting device memory to use a strongly-ordered mapping,
> getting a bus error on a write (outside MPUSS itself) shows that it
> does.

Hmm well it should be just MT_DEVICE for anything Linux ioremaps..
Care to verify that from a device driver that does ioremap on it
first?
 
> I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
> feeling hopelessly lost there... (the multitude of ARM architecture
> versions/flavors supported aren't helping.)

Heh yeah too much hardware churn going on :)

Regards,

Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
  2015-06-01 20:52                             ` Tony Lindgren
@ 2015-06-02  4:21                               ` Matthijs van Duin
  0 siblings, 0 replies; 20+ messages in thread
From: Matthijs van Duin @ 2015-06-02  4:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 1 June 2015 at 22:52, Tony Lindgren <tony@atomide.com> wrote:
> OK that must be the case I've seen then. Probably that happens
> when a device is not clocked.

It happens for any interconnect error reported as a result of
instruction fetch, but that is itself not a very common occurrence and
obviously doesn't apply to device memory.

Another case where the L3 error irq may be taken first is if the bus
error is asynchronous, but I don't think this combo can be produced on
a dm81xx or am335x, but that's mainly due to the strictly in-order
Cortex-A8 making almost every abort synchronous. I'd expect async
aborts are more common on an A9.

> Hmm well it should be just MT_DEVICE for anything Linux ioremaps..

Yikes, so both /dev/mem and uio are behaving unlike any device driver:
both use remap_pfn_range() after running the vm_page_prot though
pgprot_noncached() to set the memory type to L_PTE_MT_UNCACHED, which
counterintuitively turns out to mean strongly-ordered. o.O  Especially
uio is acting inappropriate here imho.

But this is problematic... these ranges are already mapped by the
kernel, and ARM explicitly forbids mapping the same physical range
twice with different memory attributes (and it's not the only
architecture to do so). Hmmz...

Anyhow, drifting a bit off-topic here. I'm going to some more reading
and thinking about this.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-06-02  4:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20131206213613.GA19648@earth.universe>
     [not found] ` <201502111339.54480@pali>
     [not found]   ` <CAALWOA_ngoSKjB=ZQ264Va37bBK7v41Ei45SyoYLiMdanTKnxQ@mail.gmail.com>
     [not found]     ` <201502112128.44852@pali>
2015-02-11 20:40       ` runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot) Nishanth Menon
2015-02-18 21:14         ` Pali Rohár
2015-05-28  7:37         ` Pali Rohár
2015-05-28 16:01           ` Tony Lindgren
2015-05-28 20:26             ` Matthijs van Duin
2015-05-28 22:24               ` Tony Lindgren
2015-05-28 22:27                 ` Pali Rohár
2015-05-29  0:15                   ` Tony Lindgren
2015-05-29  0:58                 ` Matthijs van Duin
2015-05-29  1:35                   ` Matthijs van Duin
2015-05-29 15:50                     ` Tony Lindgren
2015-05-29 18:16                       ` Tony Lindgren
2015-05-30 15:22                       ` Matthijs van Duin
2015-06-01 17:58                         ` Tony Lindgren
2015-06-01 20:32                           ` Matthijs van Duin
2015-06-01 20:52                             ` Tony Lindgren
2015-06-02  4:21                               ` Matthijs van Duin
2015-02-19 18:20     ` Pali Rohár
2015-02-19 20:25       ` Matthijs van Duin
2015-02-19 21:10       ` Aaro Koskinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox