linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2014-02-07 18:20 imprecise abort behaviour Ben Dooks
@ 2014-02-07 18:20 ` Ben Dooks
  2014-02-10 14:37   ` Dave Martin
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Dooks @ 2014-02-07 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

Given that imprecise aborts may be delivered after the action that
caused them (or even for non-cpu related activities such as bridge
faults from a bus-master) it is possible that the wrong process is
terminated as a result.

It is not know at this time in an SMP system which cores get notified
of an imprecise external abort, we have yet to find the right details
in the architecture reference manuals. This also means that killing
the process is probably the wrong thing to do on reception of these aborts.

Add a handler to take and print imprecise aborts and allow the process
to continue. This should ensure that the abort is shown but not kill
the process that was running on the cpu core at the time.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 arch/arm/mm/fault.c      | 15 +++++++++++++++
 arch/arm/mm/fsr-2level.c | 32 ++++++++++++++++----------------
 2 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index eb8830a..c1fd5ba 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -512,6 +512,21 @@ do_bad(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * On an imprecise external abort it is possible that the currently running
+ * process did not cause it (it could be from an external bus bridge or
+ * another device causing a fault on the bus).
+ *
+ * Always return handled as we do not know how to do it and killing the
+ * current process may not prevent future faults.
+ */
+static int
+do_bad_imprecise(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
+{
+	printk(KERN_INFO "Caught imprecise abort (0x%03x) %08lx", fsr, addr);
+	return 0;
+}
+
 struct fsr_info {
 	int	(*fn)(unsigned long addr, unsigned int fsr, struct pt_regs *regs);
 	int	sig;
diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c..8f5ef60 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -24,22 +24,22 @@ static struct fsr_info fsr_info[] = {
 	 * 10 of the FSR, and may not be recoverable.  These are only
 	 * supported if the CPU abort handler supports bit 10.
 	 */
-	{ do_bad,		SIGBUS,  0,		"unknown 16"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 17"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 18"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 19"			   },
-	{ do_bad,		SIGBUS,  0,		"lock abort"			   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 21"			   },
-	{ do_bad,		SIGBUS,  BUS_OBJERR,	"imprecise external abort"	   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 23"			   },
-	{ do_bad,		SIGBUS,  0,		"dcache parity error"		   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 25"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 26"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 27"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 28"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 29"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 30"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 31"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 16"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 17"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 18"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 19"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"lock abort"			   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 21"			   },
+	{ do_bad_imprecise,	SIGBUS,  BUS_OBJERR,	"imprecise external abort"	   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 23"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"dcache parity error"		   }, /* xscale */
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 25"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 26"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 27"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 28"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 29"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 30"			   },
+	{ do_bad_imprecise,	SIGBUS,  0,		"unknown 31"			   },
 };
 
 static struct fsr_info ifsr_info[] = {
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2014-02-07 18:20 ` [PATCH] ARM: mm: add imprecise abort non-deadly handler Ben Dooks
@ 2014-02-10 14:37   ` Dave Martin
  2014-02-10 17:25     ` Ben Dooks
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Martin @ 2014-02-10 14:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 07, 2014 at 06:20:14PM +0000, Ben Dooks wrote:
> Given that imprecise aborts may be delivered after the action that
> caused them (or even for non-cpu related activities such as bridge
> faults from a bus-master) it is possible that the wrong process is
> terminated as a result.
> 
> It is not know at this time in an SMP system which cores get notified
> of an imprecise external abort, we have yet to find the right details
> in the architecture reference manuals. This also means that killing
> the process is probably the wrong thing to do on reception of these aborts.
> 
> Add a handler to take and print imprecise aborts and allow the process
> to continue. This should ensure that the abort is shown but not kill
> the process that was running on the cpu core at the time.

Not treating these as thread-specific faults seems correct, since we
never have a way to map these aborts back to the culprit ... except that
there is a likelihood the culprit is still running when the abort fires.


"Spurious" imprecise aborts pretty much always indicate a hardware error
or a nasty bug somewhere.

Another cause is badly implemented, buggy or malicious userspace software
being given more exotic mmaps that it is qualified to deal with
responsibly.  That's a nasty bug in the distro maintainer / system
administrator / vendor.

So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
We must not encourage people to think that these aborts are somehow
benign.

If we really want people to fix their bugs, it may be worth considering
panic(), or doing this when some threshold is reached.  This may be a
bit harsh though, at least without some threshold.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2014-02-10 14:37   ` Dave Martin
@ 2014-02-10 17:25     ` Ben Dooks
  2014-02-11 15:43       ` Dave Martin
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Dooks @ 2014-02-10 17:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/02/14 14:37, Dave Martin wrote:
> On Fri, Feb 07, 2014 at 06:20:14PM +0000, Ben Dooks wrote:
>> Given that imprecise aborts may be delivered after the action that
>> caused them (or even for non-cpu related activities such as bridge
>> faults from a bus-master) it is possible that the wrong process is
>> terminated as a result.
>>
>> It is not know at this time in an SMP system which cores get notified
>> of an imprecise external abort, we have yet to find the right details
>> in the architecture reference manuals. This also means that killing
>> the process is probably the wrong thing to do on reception of these aborts.
>>
>> Add a handler to take and print imprecise aborts and allow the process
>> to continue. This should ensure that the abort is shown but not kill
>> the process that was running on the cpu core at the time.
>
> Not treating these as thread-specific faults seems correct, since we
> never have a way to map these aborts back to the culprit ... except that
> there is a likelihood the culprit is still running when the abort fires.
>
>
> "Spurious" imprecise aborts pretty much always indicate a hardware error
> or a nasty bug somewhere.

I need to find out where the one we are catching is coming from in our
system.

> Another cause is badly implemented, buggy or malicious userspace software
> being given more exotic mmaps that it is qualified to deal with
> responsibly.  That's a nasty bug in the distro maintainer / system
> administrator / vendor.
>
> So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
> We must not encourage people to think that these aborts are somehow
> benign.

Ok, KERN_ERROR or KERN_CRIT sound reasonable.

> If we really want people to fix their bugs, it may be worth considering
> panic(), or doing this when some threshold is reached.  This may be a
> bit harsh though, at least without some threshold.

I was considering also firing a WARN_ON(abort_count++ > 10) or something
similar.

-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2014-02-10 17:25     ` Ben Dooks
@ 2014-02-11 15:43       ` Dave Martin
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Martin @ 2014-02-11 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 10, 2014 at 05:25:15PM +0000, Ben Dooks wrote:
> On 10/02/14 14:37, Dave Martin wrote:

[...]

> >"Spurious" imprecise aborts pretty much always indicate a hardware error
> >or a nasty bug somewhere.
> 
> I need to find out where the one we are catching is coming from in our
> system.

Would certainly be good to know...

> >Another cause is badly implemented, buggy or malicious userspace software
> >being given more exotic mmaps that it is qualified to deal with
> >responsibly.  That's a nasty bug in the distro maintainer / system
> >administrator / vendor.
> >
> >So, I think this should be at least KERN_ERROR; maybe KERN_CRIT or above.
> >We must not encourage people to think that these aborts are somehow
> >benign.
> 
> Ok, KERN_ERROR or KERN_CRIT sound reasonable.
> 
> >If we really want people to fix their bugs, it may be worth considering
> >panic(), or doing this when some threshold is reached.  This may be a
> >bit harsh though, at least without some threshold.
> 
> I was considering also firing a WARN_ON(abort_count++ > 10) or something
> similar.

Maybe WARN_ON the first such abort, and ratelimit somehow after?

Being noisy is good -- it's just killing a random, possibly-innocent task
that's not so good.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
@ 2015-04-18  0:14 Ben Hutchings
  2015-04-18  9:08 ` Russell King - ARM Linux
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Hutchings @ 2015-04-18  0:14 UTC (permalink / raw)
  To: linux-arm-kernel

From: Ben Dooks <ben.dooks@codethink.co.uk>

Given that imprecise aborts may be delivered after the action that
caused them (or even for non-cpu related activities such as bridge
faults from a bus-master) it is possible that the wrong process is
terminated as a result.

Add a handler to take and print imprecise aborts and allow the process
to continue. This should ensure that the abort is shown but not kill
the process that was running on the cpu core at the time.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
[bwh: Change the log level and format to match other exceptions;
 set the now-unused sig and code fields for these exceptions to 0]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
---
 arch/arm/mm/fault.c      |   20 ++++++++++++++++++++
 arch/arm/mm/fsr-2level.c |   32 ++++++++++++++++----------------
 arch/arm/mm/fsr-3level.c |    2 +-
 3 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 6333d9c17875..da00fd68fdf4 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -518,6 +518,26 @@ struct fsr_info {
 	const char *name;
 };
 
+static struct fsr_info fsr_info[];
+
+/*
+ * On an imprecise external abort it is possible that the currently running
+ * process did not cause it (it could be from an external bus bridge or
+ * another device causing a fault on the bus).
+ *
+ * Always return handled as we do not know how to do it and killing the
+ * current process may not prevent future faults.
+ */
+static int
+do_bad_imprecise(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
+{
+	const struct fsr_info *inf = fsr_info + fsr_fs(fsr);
+
+	pr_alert("Imprecise abort: %s (0x%03x) at %08lx\n",
+		 inf->name, fsr, addr);
+	return 0;
+}
+
 /* FSR definition */
 #ifdef CONFIG_ARM_LPAE
 #include "fsr-3level.c"
diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c0f341..84f2d77ac150 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -24,22 +24,22 @@ static struct fsr_info fsr_info[] = {
 	 * 10 of the FSR, and may not be recoverable.  These are only
 	 * supported if the CPU abort handler supports bit 10.
 	 */
-	{ do_bad,		SIGBUS,  0,		"unknown 16"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 17"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 18"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 19"			   },
-	{ do_bad,		SIGBUS,  0,		"lock abort"			   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 21"			   },
-	{ do_bad,		SIGBUS,  BUS_OBJERR,	"imprecise external abort"	   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 23"			   },
-	{ do_bad,		SIGBUS,  0,		"dcache parity error"		   }, /* xscale */
-	{ do_bad,		SIGBUS,  0,		"unknown 25"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 26"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 27"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 28"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 29"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 30"			   },
-	{ do_bad,		SIGBUS,  0,		"unknown 31"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 16"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 17"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 18"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 19"			   },
+	{ do_bad_imprecise,	0,	 0,		"lock abort"			   }, /* xscale */
+	{ do_bad_imprecise,	0,	 0,		"unknown 21"			   },
+	{ do_bad_imprecise,	0,	 0,		"imprecise external abort"	   }, /* xscale */
+	{ do_bad_imprecise,	0,	 0,		"unknown 23"			   },
+	{ do_bad_imprecise,	0,	 0,		"dcache parity error"		   }, /* xscale */
+	{ do_bad_imprecise,	0,	 0,		"unknown 25"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 26"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 27"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 28"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 29"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 30"			   },
+	{ do_bad_imprecise,	0,	 0,		"unknown 31"			   },
 };
 
 static struct fsr_info ifsr_info[] = {
diff --git a/arch/arm/mm/fsr-3level.c b/arch/arm/mm/fsr-3level.c
index ab4409a2307e..6eaae6ec6d71 100644
--- a/arch/arm/mm/fsr-3level.c
+++ b/arch/arm/mm/fsr-3level.c
@@ -16,7 +16,7 @@ static struct fsr_info fsr_info[] = {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
 	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
-	{ do_bad,		SIGBUS,  0,		"asynchronous external abort"	},
+	{ do_bad_imprecise,	0,	 0,		"asynchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
 	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2015-04-18  0:14 [PATCH] ARM: mm: add imprecise abort non-deadly handler Ben Hutchings
@ 2015-04-18  9:08 ` Russell King - ARM Linux
  2015-04-20  8:57   ` Ben Hutchings
  0 siblings, 1 reply; 7+ messages in thread
From: Russell King - ARM Linux @ 2015-04-18  9:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Apr 18, 2015 at 01:14:43AM +0100, Ben Hutchings wrote:
> From: Ben Dooks <ben.dooks@codethink.co.uk>
> 
> Given that imprecise aborts may be delivered after the action that
> caused them (or even for non-cpu related activities such as bridge
> faults from a bus-master) it is possible that the wrong process is
> terminated as a result.
> 
> Add a handler to take and print imprecise aborts and allow the process
> to continue. This should ensure that the abort is shown but not kill
> the process that was running on the cpu core at the time.

I'm not happy with this.

On older CPUs, you generally get the "imprecise" (aka external) aborts
within a few cycles of the faulting instruction, which is good enough
to ensure that the appropriate process gets terminated.

Yes, this is not true of ARMv7, where such faults can happen a longer
time after the access which caused them.

However, I would argue that merely printing a message to the kernel
log is insufficient - especially as the kernel networking layer can
spam the log so that such messages yet rotated out of the ring buffer.

An imprecise abort is a serious condition, one which _ought_ to be very
noticable, on the level of panic()ing the kernel.  It's a data loss
condition, one which _can_ result in corruption of user data or even
filesystems.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] ARM: mm: add imprecise abort non-deadly handler
  2015-04-18  9:08 ` Russell King - ARM Linux
@ 2015-04-20  8:57   ` Ben Hutchings
  0 siblings, 0 replies; 7+ messages in thread
From: Ben Hutchings @ 2015-04-20  8:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2015-04-18 at 10:08 +0100, Russell King - ARM Linux wrote:
> On Sat, Apr 18, 2015 at 01:14:43AM +0100, Ben Hutchings wrote:
> > From: Ben Dooks <ben.dooks@codethink.co.uk>
> > 
> > Given that imprecise aborts may be delivered after the action that
> > caused them (or even for non-cpu related activities such as bridge
> > faults from a bus-master) it is possible that the wrong process is
> > terminated as a result.
> > 
> > Add a handler to take and print imprecise aborts and allow the process
> > to continue. This should ensure that the abort is shown but not kill
> > the process that was running on the cpu core at the time.
> 
> I'm not happy with this.
> 
> On older CPUs, you generally get the "imprecise" (aka external) aborts
> within a few cycles of the faulting instruction, which is good enough
> to ensure that the appropriate process gets terminated.

Right.

> Yes, this is not true of ARMv7, where such faults can happen a longer
> time after the access which caused them.

Not only that, but they can apparently be caused by DMA masters.  I
think Ben ran into a problem like that a while back and this patch dates
from then, but I don't know the details.

If there was a way to tell whose memory access caused it...

> However, I would argue that merely printing a message to the kernel
> log is insufficient - especially as the kernel networking layer can
> spam the log so that such messages yet rotated out of the ring buffer.
> 
> An imprecise abort is a serious condition, one which _ought_ to be very
> noticable, on the level of panic()ing the kernel.  It's a data loss
> condition, one which _can_ result in corruption of user data or even
> filesystems.

Would it make sense to treat this similarly to an x86 Machine Check
Error, and have a kernel parameter to choose between panic and log?  It
seems like that would be useful when debugging a driver that is
triggering this.

Ben.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-20  8:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-18  0:14 [PATCH] ARM: mm: add imprecise abort non-deadly handler Ben Hutchings
2015-04-18  9:08 ` Russell King - ARM Linux
2015-04-20  8:57   ` Ben Hutchings
  -- strict thread matches above, loose matches on Subject: below --
2014-02-07 18:20 imprecise abort behaviour Ben Dooks
2014-02-07 18:20 ` [PATCH] ARM: mm: add imprecise abort non-deadly handler Ben Dooks
2014-02-10 14:37   ` Dave Martin
2014-02-10 17:25     ` Ben Dooks
2014-02-11 15:43       ` Dave Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).