All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Dave Martin <Dave.Martin@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Arnd Bergmann <arnd@arndb.de>, Nicolas Pitre <nico@linaro.org>,
	Tony Lindgren <tony@atomide.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Will Deacon <will.deacon@arm.com>,
	Oleg Nesterov <oleg@redhat.com>,
	James Morse <james.morse@arm.com>,
	Olof Johansson <olof@lixom.net>,
	Santosh Shilimkar <santosh.shilimkar@ti.com>,
	linux-arm-kernel@lists.infradead.org,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 07/11] signal/arm64: Document conflicts with SI_USER and SIGFPE, SIGTRAP, SIGBUS
Date: Mon, 15 Jan 2018 11:23:03 -0600	[thread overview]
Message-ID: <87h8rnox3c.fsf@xmission.com> (raw)
In-Reply-To: <20180115163028.GU22781@e103592.cambridge.arm.com> (Dave Martin's message of "Mon, 15 Jan 2018 16:30:29 +0000")

Dave Martin <Dave.Martin@arm.com> writes:

> On Thu, Jan 11, 2018 at 06:59:36PM -0600, Eric W. Biederman wrote:
>> Setting si_code to 0 results in a userspace seeing an si_code of 0.
>> This is the same si_code as SI_USER.  Posix and common sense requires
>> that SI_USER not be a signal specific si_code.  As such this use of 0
>> for the si_code is a pretty horribly broken ABI.
>
> I think this situation may have come about because 0 is used as a
> padding value for "impossible" cases -- i.e., things that can't happen
> unless the kernel is broken, or things that are too unrecoverable for
> clean error reporting to be helpful.
>
> In general, I think these values are not expected to reach userspace in
> practice.
>
> This is not an excuse though -- and not 100% true -- so it's certainly
> worthy of cleanup.
>
>
> It would be good to approach this similarly for arm and arm64, since
> the arm64 fault code is derived from arm.

In this case the fault_info is something I have only seen on arm64.
I have been approaching all architectures the same way.

If there is insufficient information without architecture expertise
to fix this class of error I have been ading FPE_FIXME to them.

>> Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
>> value of __SI_KILL and now sees a value of SIL_KILL with the result
>> that uid and pid fields are copied and which might copying the si_addr
>> field by accident but certainly not by design.  Making this a very
>> flakey implementation.
>> 
>> Utilizing FPE_FIXME, BUS_FIXME, TRAP_FIXME siginfo_layout will now return
>> SIL_FAULT and the appropriate fields will be reliably copied.
>> 
>> But folks this is a new and unique kind of bad.  This is massively
>> untested code bad.  This is inventing new and unique was to get
>> siginfo wrong bad.  This is don't even think about Posix or what
>> siginfo means bad.  This is lots of eyeballs all missing the fact
>> that the code does the wrong thing bad.  This is getting stuck
>> and keep making the same mistake bad.
>> 
>> I really hope we can find a non userspace breaking fix for this on a
>> port as new as arm64.
>
>> Possible ABI fixes include:
>> - Send the signal without siginfo
>> - Don't generate a signal
>
> The above two sould like ABI breaks?

They are ways I have seen code on other platforms deal with
not information to generate siginfo.  Sending the signal without siginfo
is roughly equivalent to your send SIGKILL suggestion below.

A good example of that is code that calls force_sigsegv.

Calling "force_sig(SIGBUS, current);" is perfectly valid.
And then the parent when it reaped the process would have
a little more information to go on when guessing what happened
to the process.

>> - Possibly assign and use an appropriate si_code
>> - Don't handle cases which can't happen
>
> I think a mixture of these two is the best approach.
>
> In any case, si_code == 0 here doesn't seem to have any explicit meaning.
> I think we can translate all of the arm64 faults to proper si_codes --
> see my sketch below.  Probably means a bit more thought though.

Yes I would be very happy to see that.

> The only counterargument would be if there is software relying on
> these bogus signal cases getting si_code == 0 for a useful purpose.
>
> The main reason I see to check for SI_USER is to allow a process to
> filter out spurious signals (say, an asynchronous I/O signal for
> which si_value would be garbage), and to print out diagnostics
> before (in the case of a well-behaved program) resetting the signal
> to SIG_DFL and killing itself to report the signal to the waiter.
>
> Daemons may be more discerning about who is allowed to signal them,
> but overloading SIGBUS (say) as an IPC channel sounds like a very odd
> thing to do.  The same probably applies to any signal that has
> nontrivial metadata.

Agreed.  Although I have seen ltp test cases that do crazy things like
that.

> Have you found software that is impacted by this in practice?

No.

I don't expect many userspace applications look at siginfo and
everything I have found is some rare hard to trigger non-x86 case which
limits the exposure to userspace applications tremendously.

The angle I am coming at all of this from is that the linux kernel code
that filled out out struct siginfo was not comprehensible or correct.
Internal to the kernel it was using a magic value (not exportable to
userspace) in the upper bits of si_code.  That was causing problems for
signal injection and converting signals from 32bit to 64bit, and from
64bit to 32bit.

So I wrote kernel/signal.c:siginfo_layout() to figure out which fields
of struct siginfo should be sent to userspace.  In doing so I discovered
that using 0 in si_code (aka SI_USER) is ambiguous, and problematic.

Unfortuantely in most of the cases I have spotted using 0 in the si_code
requires architectural knowledge that I don't currently have to sort
out.  So the best I can do is change si_code from 0 to
FPE_FIXME/BUS_FIXME/TRAP_FIXME and bring the architecture maintainers
attention to this area.

One of the problems that results from all of this is that we copy
unitialized data to userspace.   I am slowly unifying and cleaning the
code up so that the code is simple enough we can be certain we are
not copying unitialized data to userspace.

With si_coes of FPE_FIXME/BUS_FIXME/TRAP_FIXME I can at least attempt to
keep the craziness from happening.

My next step is to unify struct siginfo and struct compat_siginfo
and the functions that copy them to userspace because there are very
siginficant problems there.


All of that said I like the way you are thinking about fixing these
issues.

> [...]
>
>> +++ b/arch/arm64/kernel/fpsimd.c
>> @@ -867,7 +867,7 @@ asmlinkage void do_fpsimd_acc(unsigned int esr, struct pt_regs *regs)
>>  asmlinkage void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
>>  {
>>  	siginfo_t info;
>> -	unsigned int si_code = 0;
>> +	unsigned int si_code = FPE_FIXME;
>>  
>>  	if (esr & FPEXC_IOF)
>>  		si_code = FPE_FLTINV;
>
> This 0 can happen for vector operations where the implementation may
> not be able to report exactly what happened, for example where
> the implementer didn't want to pay the cost of tracking exactly
> what went wrong in each lane.
>
> However, the FPEXC_* bits can be garbage in such a case rather
> than being all zero: we should be checking the TFV bit in the ESR here.
> This may be a bug.
>
> Perhaps FPE_FLTINV should be returned in si_code for such cases:  it's
> not otherwise used on arm64 -- invalid instructions would be reported as
> SIGILL/ILL_ILLOPC instead).
>
> Otherwise, we might want to define a new code or arbitrarily pick
> one of the existing FLT_* since this is really a more benign condition
> than executing an illegal instruction.  Alternatively, treat the
> fault as spurious and suppress it, but that doesn't feel right either.

I would love to see this sorted out.  There is a very similar pattern
on several different architectures.  I suspect if we have a clean
solution on one architecture the other architectures will be able to use
that solution as well.

>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 9b7f89df49db..abe200587334 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -596,7 +596,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  
>>  	info.si_signo = SIGBUS;
>>  	info.si_errno = 0;
>> -	info.si_code  = 0;
>> +	info.si_code  = BUS_FIXME;
>
> Probably BUS_OBJERR.
>
>>  	if (esr & ESR_ELx_FnV)
>>  		info.si_addr = NULL;
>>  	else
>> @@ -607,70 +607,70 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  }
>>  
>>  static const struct fault_info fault_info[] = {
>> -	{ do_bad,		SIGBUS,  0,		"ttbr address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 1 address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 2 address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 3 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"ttbr address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 1 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 2 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 3 address size fault"	},
>
> Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic().
>
> [...]
>
>> -	{ do_bad,		SIGBUS,  0,		"unknown 8"			},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"unknown 8"			},
>
> [...]
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"unknown 12"			},
>
> Not architected, so they could mean absolutely anything.  If they
> can happen at all, they are probably unsafe to ignore.
>
>  -> SIGKILL, or panic().
>
> Similary for all the "unknown" codes in the table, which I omit for
> brevity.
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"synchronous external abort"	},
>
> This si_code seems to be a fallback for if ACPI is absent or doesn't
> know what to do with this error.
>
> -> SIGBUS/BUS_OBJERR?
>
> Can probably legitimately happen for userspace for suitable MMIO mappings.
>
> Perhaps it's more serious though in the presence of ACPI.  Do we expect
> that ACPI can diagnose all localisable errors?
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 0 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 1 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 2 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 3 (translation table walk)"	},
>
> Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic().
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"synchronous parity or ECC error" },	// Reserved when RAS is implemented
>
> Possibly SIGBUS/BUS_MCEERR_AR (though I don't know exactly what
> userspace is supposed to do with this or whether this implies the
> existence or certain kernel features for managing the error that
> may not be present on arm64...)
>
> Otherwise, SIGKILL.

Yes.   The AR Action Required and AO Action optional bits I don't quite
understand.  But BUS_MCEERR_AR does sound like a good fit.


>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 0 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 1 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 2 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 3 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>
> Process page tables corrupt: if the kernel couldn't fix this, the
> process can't reasonably fix it -> SIGKILL
>
> Since this is a RAS-type error it could be triggered by a cosmic ray
> rather than requiring a kernel or system bug or other major failure, so
> we probably shouldn't panic the system if the error is localisable to a
> particular process.
>
>>  	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"TLB conflict abort"		},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"Unsupported atomic hardware update fault"	},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"implementation fault (lockdown abort)" },
>
> Userspace shouldn't have access to lockdown: kernel/system bug
> -> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"implementation fault (unsupported exclusive)" },
>
> If running on an implementation where this fault can happen in response to an exclusive load/store issued by userspace may fail somewhere in the memory system, this should probably be SIGBUS/BUS_OBJERR (or possibly a new BUS_* code).
>
> This one may need to be hardware-dependent, if this fault can mean
> something different depending on the hardware (I'm gussing this
> possibility from "implementation" -- I've not checked the docs.)
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"section domain fault"		},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"page domain fault"		},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>>  };
>>  
>>  int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>> @@ -739,11 +739,11 @@ static struct fault_info __refdata debug_fault_info[] = {
>> +	{ do_bad,	SIGBUS,		BUS_FIXME,	"unknown 3"		},
>> +	{ do_bad,	SIGTRAP,	TRAP_FIXME,	"aarch32 vector catch"	},
>> +	{ do_bad,	SIGBUS,		BUS_FIXME,	"unknown 7"		},
>>  };
>
> Impossible (?), or meaning unknown.
> SIGKILL/panic() for these?  Or possibly (since these are probably well
> localised errors) SIGILL/ILL_ILLOPC.

I like the way you are thinking on these, and I'd love to see them
fixed.

Eric

WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 07/11] signal/arm64: Document conflicts with SI_USER and SIGFPE, SIGTRAP, SIGBUS
Date: Mon, 15 Jan 2018 11:23:03 -0600	[thread overview]
Message-ID: <87h8rnox3c.fsf@xmission.com> (raw)
In-Reply-To: <20180115163028.GU22781@e103592.cambridge.arm.com> (Dave Martin's message of "Mon, 15 Jan 2018 16:30:29 +0000")

Dave Martin <Dave.Martin@arm.com> writes:

> On Thu, Jan 11, 2018 at 06:59:36PM -0600, Eric W. Biederman wrote:
>> Setting si_code to 0 results in a userspace seeing an si_code of 0.
>> This is the same si_code as SI_USER.  Posix and common sense requires
>> that SI_USER not be a signal specific si_code.  As such this use of 0
>> for the si_code is a pretty horribly broken ABI.
>
> I think this situation may have come about because 0 is used as a
> padding value for "impossible" cases -- i.e., things that can't happen
> unless the kernel is broken, or things that are too unrecoverable for
> clean error reporting to be helpful.
>
> In general, I think these values are not expected to reach userspace in
> practice.
>
> This is not an excuse though -- and not 100% true -- so it's certainly
> worthy of cleanup.
>
>
> It would be good to approach this similarly for arm and arm64, since
> the arm64 fault code is derived from arm.

In this case the fault_info is something I have only seen on arm64.
I have been approaching all architectures the same way.

If there is insufficient information without architecture expertise
to fix this class of error I have been ading FPE_FIXME to them.

>> Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
>> value of __SI_KILL and now sees a value of SIL_KILL with the result
>> that uid and pid fields are copied and which might copying the si_addr
>> field by accident but certainly not by design.  Making this a very
>> flakey implementation.
>> 
>> Utilizing FPE_FIXME, BUS_FIXME, TRAP_FIXME siginfo_layout will now return
>> SIL_FAULT and the appropriate fields will be reliably copied.
>> 
>> But folks this is a new and unique kind of bad.  This is massively
>> untested code bad.  This is inventing new and unique was to get
>> siginfo wrong bad.  This is don't even think about Posix or what
>> siginfo means bad.  This is lots of eyeballs all missing the fact
>> that the code does the wrong thing bad.  This is getting stuck
>> and keep making the same mistake bad.
>> 
>> I really hope we can find a non userspace breaking fix for this on a
>> port as new as arm64.
>
>> Possible ABI fixes include:
>> - Send the signal without siginfo
>> - Don't generate a signal
>
> The above two sould like ABI breaks?

They are ways I have seen code on other platforms deal with
not information to generate siginfo.  Sending the signal without siginfo
is roughly equivalent to your send SIGKILL suggestion below.

A good example of that is code that calls force_sigsegv.

Calling "force_sig(SIGBUS, current);" is perfectly valid.
And then the parent when it reaped the process would have
a little more information to go on when guessing what happened
to the process.

>> - Possibly assign and use an appropriate si_code
>> - Don't handle cases which can't happen
>
> I think a mixture of these two is the best approach.
>
> In any case, si_code == 0 here doesn't seem to have any explicit meaning.
> I think we can translate all of the arm64 faults to proper si_codes --
> see my sketch below.  Probably means a bit more thought though.

Yes I would be very happy to see that.

> The only counterargument would be if there is software relying on
> these bogus signal cases getting si_code == 0 for a useful purpose.
>
> The main reason I see to check for SI_USER is to allow a process to
> filter out spurious signals (say, an asynchronous I/O signal for
> which si_value would be garbage), and to print out diagnostics
> before (in the case of a well-behaved program) resetting the signal
> to SIG_DFL and killing itself to report the signal to the waiter.
>
> Daemons may be more discerning about who is allowed to signal them,
> but overloading SIGBUS (say) as an IPC channel sounds like a very odd
> thing to do.  The same probably applies to any signal that has
> nontrivial metadata.

Agreed.  Although I have seen ltp test cases that do crazy things like
that.

> Have you found software that is impacted by this in practice?

No.

I don't expect many userspace applications look at siginfo and
everything I have found is some rare hard to trigger non-x86 case which
limits the exposure to userspace applications tremendously.

The angle I am coming at all of this from is that the linux kernel code
that filled out out struct siginfo was not comprehensible or correct.
Internal to the kernel it was using a magic value (not exportable to
userspace) in the upper bits of si_code.  That was causing problems for
signal injection and converting signals from 32bit to 64bit, and from
64bit to 32bit.

So I wrote kernel/signal.c:siginfo_layout() to figure out which fields
of struct siginfo should be sent to userspace.  In doing so I discovered
that using 0 in si_code (aka SI_USER) is ambiguous, and problematic.

Unfortuantely in most of the cases I have spotted using 0 in the si_code
requires architectural knowledge that I don't currently have to sort
out.  So the best I can do is change si_code from 0 to
FPE_FIXME/BUS_FIXME/TRAP_FIXME and bring the architecture maintainers
attention to this area.

One of the problems that results from all of this is that we copy
unitialized data to userspace.   I am slowly unifying and cleaning the
code up so that the code is simple enough we can be certain we are
not copying unitialized data to userspace.

With si_coes of FPE_FIXME/BUS_FIXME/TRAP_FIXME I can at least attempt to
keep the craziness from happening.

My next step is to unify struct siginfo and struct compat_siginfo
and the functions that copy them to userspace because there are very
siginficant problems there.


All of that said I like the way you are thinking about fixing these
issues.

> [...]
>
>> +++ b/arch/arm64/kernel/fpsimd.c
>> @@ -867,7 +867,7 @@ asmlinkage void do_fpsimd_acc(unsigned int esr, struct pt_regs *regs)
>>  asmlinkage void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
>>  {
>>  	siginfo_t info;
>> -	unsigned int si_code = 0;
>> +	unsigned int si_code = FPE_FIXME;
>>  
>>  	if (esr & FPEXC_IOF)
>>  		si_code = FPE_FLTINV;
>
> This 0 can happen for vector operations where the implementation may
> not be able to report exactly what happened, for example where
> the implementer didn't want to pay the cost of tracking exactly
> what went wrong in each lane.
>
> However, the FPEXC_* bits can be garbage in such a case rather
> than being all zero: we should be checking the TFV bit in the ESR here.
> This may be a bug.
>
> Perhaps FPE_FLTINV should be returned in si_code for such cases:  it's
> not otherwise used on arm64 -- invalid instructions would be reported as
> SIGILL/ILL_ILLOPC instead).
>
> Otherwise, we might want to define a new code or arbitrarily pick
> one of the existing FLT_* since this is really a more benign condition
> than executing an illegal instruction.  Alternatively, treat the
> fault as spurious and suppress it, but that doesn't feel right either.

I would love to see this sorted out.  There is a very similar pattern
on several different architectures.  I suspect if we have a clean
solution on one architecture the other architectures will be able to use
that solution as well.

>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 9b7f89df49db..abe200587334 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -596,7 +596,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  
>>  	info.si_signo = SIGBUS;
>>  	info.si_errno = 0;
>> -	info.si_code  = 0;
>> +	info.si_code  = BUS_FIXME;
>
> Probably BUS_OBJERR.
>
>>  	if (esr & ESR_ELx_FnV)
>>  		info.si_addr = NULL;
>>  	else
>> @@ -607,70 +607,70 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  }
>>  
>>  static const struct fault_info fault_info[] = {
>> -	{ do_bad,		SIGBUS,  0,		"ttbr address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 1 address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 2 address size fault"	},
>> -	{ do_bad,		SIGBUS,  0,		"level 3 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"ttbr address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 1 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 2 address size fault"	},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"level 3 address size fault"	},
>
> Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic().
>
> [...]
>
>> -	{ do_bad,		SIGBUS,  0,		"unknown 8"			},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"unknown 8"			},
>
> [...]
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"unknown 12"			},
>
> Not architected, so they could mean absolutely anything.  If they
> can happen at all, they are probably unsafe to ignore.
>
>  -> SIGKILL, or panic().
>
> Similary for all the "unknown" codes in the table, which I omit for
> brevity.
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"synchronous external abort"	},
>
> This si_code seems to be a fallback for if ACPI is absent or doesn't
> know what to do with this error.
>
> -> SIGBUS/BUS_OBJERR?
>
> Can probably legitimately happen for userspace for suitable MMIO mappings.
>
> Perhaps it's more serious though in the presence of ACPI.  Do we expect
> that ACPI can diagnose all localisable errors?
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 0 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 1 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 2 (translation table walk)"	},
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 3 (translation table walk)"	},
>
> Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic().
>
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"synchronous parity or ECC error" },	// Reserved when RAS is implemented
>
> Possibly SIGBUS/BUS_MCEERR_AR (though I don't know exactly what
> userspace is supposed to do with this or whether this implies the
> existence or certain kernel features for managing the error that
> may not be present on arm64...)
>
> Otherwise, SIGKILL.

Yes.   The AR Action Required and AO Action optional bits I don't quite
understand.  But BUS_MCEERR_AR does sound like a good fit.


>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 0 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 1 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 2 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>> +	{ do_sea,		SIGBUS,  BUS_FIXME,	"level 3 synchronous parity error (translation table walk)"	},	// Reserved when RAS is implemented
>
> Process page tables corrupt: if the kernel couldn't fix this, the
> process can't reasonably fix it -> SIGKILL
>
> Since this is a RAS-type error it could be triggered by a cosmic ray
> rather than requiring a kernel or system bug or other major failure, so
> we probably shouldn't panic the system if the error is localisable to a
> particular process.
>
>>  	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"TLB conflict abort"		},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"Unsupported atomic hardware update fault"	},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"implementation fault (lockdown abort)" },
>
> Userspace shouldn't have access to lockdown: kernel/system bug
> -> SIGKILL or panic().
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"implementation fault (unsupported exclusive)" },
>
> If running on an implementation where this fault can happen in response to an exclusive load/store issued by userspace may fail somewhere in the memory system, this should probably be SIGBUS/BUS_OBJERR (or possibly a new BUS_* code).
>
> This one may need to be hardware-dependent, if this fault can mean
> something different depending on the hardware (I'm gussing this
> possibility from "implementation" -- I've not checked the docs.)
>
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"section domain fault"		},
>> +	{ do_bad,		SIGBUS,  BUS_FIXME,	"page domain fault"		},
>
> Broken kernel, kernel memory corruption, CPU/system bug etc.:
> SIGKILL or panic().
>
>>  };
>>  
>>  int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>> @@ -739,11 +739,11 @@ static struct fault_info __refdata debug_fault_info[] = {
>> +	{ do_bad,	SIGBUS,		BUS_FIXME,	"unknown 3"		},
>> +	{ do_bad,	SIGTRAP,	TRAP_FIXME,	"aarch32 vector catch"	},
>> +	{ do_bad,	SIGBUS,		BUS_FIXME,	"unknown 7"		},
>>  };
>
> Impossible (?), or meaning unknown.
> SIGKILL/panic() for these?  Or possibly (since these are probably well
> localised errors) SIGILL/ILL_ILLOPC.

I like the way you are thinking on these, and I'd love to see them
fixed.

Eric

  reply	other threads:[~2018-01-15 17:23 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  0:57 [PATCH 00/11] siginfo fixes/cleanups esp SI_USER Eric W. Biederman
2018-01-12  0:57 ` Eric W. Biederman
2018-01-12  0:59 ` [PATCH 01/11] signal: Simplify and fix kdb_send_sig Eric W. Biederman
2018-01-12  0:59 ` [PATCH 02/11] signal/sh: Ensure si_signo is initialized in do_divide_error Eric W. Biederman
2018-01-12  0:59   ` Eric W. Biederman
2018-01-12  0:59 ` [PATCH 03/11] signal/openrisc: Fix do_unaligned_access to send the proper signal Eric W. Biederman
2018-01-12  0:59   ` [OpenRISC] " Eric W. Biederman
2018-01-12 13:25   ` Stafford Horne
2018-01-12 13:25     ` [OpenRISC] " Stafford Horne
2018-01-12 17:37     ` Eric W. Biederman
2018-01-12 17:37       ` [OpenRISC] " Eric W. Biederman
2018-01-12  0:59 ` [PATCH 04/11] signal/parisc: Document a conflict with SI_USER with SIGFPE Eric W. Biederman
2018-01-12 22:29   ` Helge Deller
2018-01-13 21:06     ` Eric W. Biederman
2018-01-14  1:46       ` Eric W. Biederman
2018-02-23  0:15     ` Eric W. Biederman
2018-02-25 19:49       ` Helge Deller
2018-02-27  2:19         ` Eric W. Biederman
2018-01-12  0:59 ` [PATCH 05/11] signal/metag: " Eric W. Biederman
2018-01-12  0:59 ` [PATCH 06/11] signal/powerpc: Document conflicts with SI_USER and SIGFPE and SIGTRAP Eric W. Biederman
2018-01-12  0:59 ` [PATCH 07/11] signal/arm64: Document conflicts with SI_USER and SIGFPE,SIGTRAP,SIGBUS Eric W. Biederman
2018-01-12  0:59   ` [PATCH 07/11] signal/arm64: Document conflicts with SI_USER and SIGFPE, SIGTRAP, SIGBUS Eric W. Biederman
2018-01-15 16:30   ` Dave Martin
2018-01-15 16:30     ` Dave Martin
2018-01-15 17:23     ` Eric W. Biederman [this message]
2018-01-15 17:23       ` Eric W. Biederman
2018-01-16 17:24       ` Dave Martin
2018-01-16 22:28         ` Eric W. Biederman
2018-01-17 11:46           ` Dave Martin
2018-01-17 11:46             ` Dave Martin
2018-01-17 11:57           ` Russell King - ARM Linux
2018-01-17 11:57             ` Russell King - ARM Linux
2018-01-17 12:15             ` Dave Martin
2018-01-17 12:15               ` Dave Martin
2018-01-17 12:37               ` Russell King - ARM Linux
2018-01-17 12:37                 ` Russell King - ARM Linux
2018-01-17 15:37                 ` Dave Martin
2018-01-17 15:37                   ` Dave Martin
2018-01-17 15:49                   ` Russell King - ARM Linux
2018-01-17 15:49                     ` Russell King - ARM Linux
2018-01-17 16:11                     ` Dave Martin
2018-01-17 16:11                       ` Dave Martin
2018-01-17 16:45                 ` Eric W. Biederman
2018-01-17 16:45                   ` Eric W. Biederman
2018-01-17 16:45                   ` Eric W. Biederman
2018-01-17 16:45                   ` Eric W. Biederman
2018-01-17 17:14                   ` Russell King - ARM Linux
2018-01-17 17:14                     ` Russell King - ARM Linux
2018-01-24 21:28                     ` Eric W. Biederman
2018-01-24 21:28                       ` Eric W. Biederman
2018-01-24 21:28                       ` Eric W. Biederman
2018-01-17 17:17       ` Dave Martin
2018-01-17 17:17         ` Dave Martin
2018-01-17 17:24         ` Eric W. Biederman
2018-01-17 17:24           ` Eric W. Biederman
2018-01-17 17:39           ` Dave Martin
2018-01-17 17:39             ` Dave Martin
2018-01-15 19:30     ` James Morse
2018-01-15 19:30       ` James Morse
2018-01-12  0:59 ` [PATCH 08/11] signal/arm: Document conflicts with SI_USER and SIGFPE Eric W. Biederman
2018-01-12  0:59   ` Eric W. Biederman
2018-01-12  0:59   ` Eric W. Biederman
2018-01-15 17:49   ` Russell King - ARM Linux
2018-01-15 17:49     ` Russell King - ARM Linux
2018-01-15 20:12     ` Eric W. Biederman
2018-01-15 20:12       ` Eric W. Biederman
2018-01-16 17:41     ` Dave Martin
2018-01-19 12:05     ` Dave Martin
2018-01-19 12:05       ` Dave Martin
2018-01-12  0:59 ` [PATCH 09/11] signal: Reduce copy_siginfo to just a memcpy Eric W. Biederman
2018-01-12  0:59 ` [PATCH 10/11] signal: Introduce clear_siginfo Eric W. Biederman
2018-01-12  0:59 ` [PATCH 11/11] signal: Ensure generic siginfos the kernel sends have all bits initialized Eric W. Biederman
2018-01-12 20:29 ` [PATCH 0/2] siginfo fixes Eric W. Biederman
2018-01-12 20:29   ` Eric W. Biederman
2018-01-12 20:31   ` [PATCH 1/2] mn10300/misalignment: Use SIGSEGV SEGV_MAPERR to report a failed user copy Eric W. Biederman
2018-01-12 20:31   ` [PATCH 2/2] x86/mm/pkeys: Fix fill_sig_info_pkey Eric W. Biederman
2018-01-14 11:44     ` [tip:x86/urgent] " tip-bot for Eric W. Biederman
2018-01-16  0:39   ` [PATCH 00/22] siginfo unification Eric W. Biederman
2018-01-16  0:39     ` Eric W. Biederman
2018-01-16  0:39     ` [PATCH 01/22] signal: Document all of the signals that use the _sigfault union member Eric W. Biederman
2018-01-16  0:39     ` [PATCH 02/22] signal: Document the strange si_codes used by ptrace event stops Eric W. Biederman
2018-01-16  0:39     ` [PATCH 03/22] signal: Document glibc's si_code of SI_ASYNCNL Eric W. Biederman
2018-01-16  0:39     ` [PATCH 04/22] signal: Ensure no siginfo union member increases the size of struct siginfo Eric W. Biederman
2018-01-16  0:39     ` [PATCH 05/22] signal: Clear si_sys_private before copying siginfo to userspace Eric W. Biederman
2018-01-16  0:39     ` [PATCH 06/22] signal: Remove _sys_private and _overrun_incr from struct compat_siginfo Eric W. Biederman
2018-01-16  0:39     ` [PATCH 07/22] ia64/signal: switch to generic struct siginfo Eric W. Biederman
2018-01-16  0:39     ` [PATCH 08/22] signal/ia64: switch the last arch-specific copy_siginfo_to_user() to generic version Eric W. Biederman
2018-01-16  0:39     ` [PATCH 09/22] signal/mips: switch mips to generic siginfo Eric W. Biederman
2018-01-16  0:39     ` [PATCH 10/22] signal: Remove unnecessary ifdefs now that there is only one struct siginfo Eric W. Biederman
2018-01-16  0:39     ` [PATCH 11/22] signal: kill __ARCH_SI_UID_T Eric W. Biederman
2018-01-16  0:39     ` [PATCH 12/22] signal: unify compat_siginfo_t Eric W. Biederman
2018-01-16  0:40     ` [PATCH 13/22] signal: Move addr_lsb into the _sigfault union for clarity Eric W. Biederman
2018-03-16 19:00       ` Dave Hansen
2018-03-16 19:24         ` Dave Hansen
2018-03-16 20:06           ` Eric W. Biederman
2018-03-16 20:33             ` Dave Hansen
2018-03-16 21:08               ` Eric W. Biederman
2018-01-16  0:40     ` [PATCH 14/22] signal/powerpc: Remove redefinition of NSIGTRAP on powerpc Eric W. Biederman
2018-01-16  0:40     ` [PATCH 15/22] signal/ia64: Move the ia64 specific si_codes to asm-generic/siginfo.h Eric W. Biederman
2018-01-16  0:40     ` [PATCH 16/22] signal/frv: Move the frv " Eric W. Biederman
2018-01-16  0:40     ` [PATCH 17/22] signal/tile: Move the tile " Eric W. Biederman
2018-01-16  0:40     ` [PATCH 18/22] signal/blackfin: Move the blackfin " Eric W. Biederman
2018-01-16  0:40     ` [PATCH 19/22] signal/blackfin: Remove pointless UID16_SIGINFO_COMPAT_NEEDED Eric W. Biederman
2018-01-16  0:40     ` [PATCH 20/22] signal: Unify and correct copy_siginfo_from_user32 Eric W. Biederman
2018-01-16  0:40     ` [PATCH 21/22] signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32 Eric W. Biederman
2018-01-16  0:40     ` [PATCH 22/22] signal: Unify and correct copy_siginfo_to_user32 Eric W. Biederman
2018-01-19 18:03       ` Al Viro
2018-01-19 21:04         ` Eric W. Biederman
2018-01-23 21:05     ` [PATCH 00/10] siginfo infrastructure Eric W. Biederman
2018-01-23 21:05       ` Eric W. Biederman
2018-01-23 21:07       ` [PATCH 01/10] ptrace: Use copy_siginfo in setsiginfo and getsiginfo Eric W. Biederman
2018-01-23 21:07       ` [PATCH 02/10] signal/arm64: Better isolate the COMPAT_TASK portion of ptrace_hbptriggered Eric W. Biederman
2018-01-23 21:07       ` [PATCH 03/10] signal: Don't use structure initializers for struct siginfo Eric W. Biederman
2018-01-23 21:07       ` [PATCH 04/10] signal: Replace memset(info,...) with clear_siginfo for clarity Eric W. Biederman
2018-01-23 21:07       ` [PATCH 05/10] signal: Add send_sig_fault and force_sig_fault Eric W. Biederman
2018-01-23 21:07       ` [PATCH 06/10] signal: Helpers for faults with specialized siginfo layouts Eric W. Biederman
2018-01-24 19:26         ` Ram Pai
2018-01-24 20:54           ` Eric W. Biederman
2018-01-23 21:07       ` [PATCH 07/10] signal/powerpc: Remove unnecessary signal_code parameter of do_send_trap Eric W. Biederman
2018-01-23 21:07       ` [PATCH 08/10] signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed Eric W. Biederman
2018-01-23 21:07       ` [PATCH 09/10] mm/memory_failure: Remove unused trapno from memory_failure Eric W. Biederman
2018-01-23 21:07       ` [PATCH 10/10] signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h8rnox3c.fsf@xmission.com \
    --to=ebiederm@xmission.com \
    --cc=Dave.Martin@arm.com \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nico@linaro.org \
    --cc=oleg@redhat.com \
    --cc=olof@lixom.net \
    --cc=santosh.shilimkar@ti.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony@atomide.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.