* SError Interrupt on CPU0, code 0xbf000000 makes kernel panic @ 2022-03-24 12:10 Joakim Tjernlund 2022-03-24 13:16 ` Robin Murphy 0 siblings, 1 reply; 10+ messages in thread From: Joakim Tjernlund @ 2022-03-24 12:10 UTC (permalink / raw) To: linux-arm-kernel@lists.infradead.org We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 [ 37.573150] Hardware name: infinera,xr (DT) [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) [ 37.574705] pc : 000000000098775c [ 37.575063] lr : 0000000000986918 [ 37.575392] sp : 00000000ffd140a8 [ 37.575725] x12: 0000000000a36c10 [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt [ 37.582685] Kernel Offset: disabled [ 37.582932] CPU features: 0x00001001,20000842 [ 37.583509] Memory Limit: none [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. Is this what to expect? I see that kernel looks for the RAS extension but we don't have that. Can anything be done not to panic the kernel for such accesses? Can one build a som sort of blacklisted address spaces which the MMU will block? Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 12:10 SError Interrupt on CPU0, code 0xbf000000 makes kernel panic Joakim Tjernlund @ 2022-03-24 13:16 ` Robin Murphy 2022-03-24 14:01 ` Joakim Tjernlund 0 siblings, 1 reply; 10+ messages in thread From: Robin Murphy @ 2022-03-24 13:16 UTC (permalink / raw) To: Joakim Tjernlund, linux-arm-kernel@lists.infradead.org On 2022-03-24 12:10, Joakim Tjernlund wrote: > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write > > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 > [ 37.573150] Hardware name: infinera,xr (DT) > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) > [ 37.574705] pc : 000000000098775c > [ 37.575063] lr : 0000000000986918 > [ 37.575392] sp : 00000000ffd140a8 > [ 37.575725] x12: 0000000000a36c10 > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt > [ 37.582685] Kernel Offset: disabled > [ 37.582932] CPU features: 0x00001001,20000842 > [ 37.583509] Memory Limit: none > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. > Is this what to expect? > I see that kernel looks for the RAS extension but we don't have that. > > Can anything be done not to panic the kernel for such accesses? No. The error comes back to the CPU in an unattributable manner, so all it knows is that *something*, at some point in the past, went catastrophically wrong. Saying "this is fine..." and carrying on regardless isn't really viable. IIRC the RAS extension places constraints on the delivery of async SError such that it's slightly more possible to do something with, but without that all bets are off. > Can one build a som sort of blacklisted address spaces which the MMU will block? Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never access anything invalid. I'm not even entirely joking there - even for address ranges that the kernel *does* know about, you can still SError or deadlock by poking at something that's currently clock-gated or powered off, or lose coherency and cause corruption by accessing memory with the wrong attributes; at worst writing the wrong thing to the wrong place may even physically damage the hardware. Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 13:16 ` Robin Murphy @ 2022-03-24 14:01 ` Joakim Tjernlund 2022-03-24 14:17 ` Marc Zyngier 0 siblings, 1 reply; 10+ messages in thread From: Joakim Tjernlund @ 2022-03-24 14:01 UTC (permalink / raw) To: robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote: > On 2022-03-24 12:10, Joakim Tjernlund wrote: > > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: > > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write > > > > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError > > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 > > [ 37.573150] Hardware name: infinera,xr (DT) > > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) > > [ 37.574705] pc : 000000000098775c > > [ 37.575063] lr : 0000000000986918 > > [ 37.575392] sp : 00000000ffd140a8 > > [ 37.575725] x12: 0000000000a36c10 > > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 > > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c > > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 > > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 > > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt > > [ 37.582685] Kernel Offset: disabled > > [ 37.582932] CPU features: 0x00001001,20000842 > > [ 37.583509] Memory Limit: none > > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- > > > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. > > Is this what to expect? > > I see that kernel looks for the RAS extension but we don't have that. > > > > Can anything be done not to panic the kernel for such accesses? > > No. The error comes back to the CPU in an unattributable manner, so all > it knows is that *something*, at some point in the past, went > catastrophically wrong. Saying "this is fine..." and carrying on > regardless isn't really viable. IIRC the RAS extension places > constraints on the delivery of async SError such that it's slightly more > possible to do something with, but without that all bets are off. And this is because we don't have RAS? If we did have RAS would/could kernel sort out the error and the app would get an SIGBUS or similar? > > > Can one build a som sort of blacklisted address spaces which the MMU will block? > > Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never > access anything invalid. > I'm not even entirely joking there - even for address ranges that the > kernel *does* know about, you can still SError or deadlock by poking at > something that's currently clock-gated or powered off, or lose coherency > and cause corruption by accessing memory with the wrong attributes; at > worst writing the wrong thing to the wrong place may even physically > damage the hardware. > I know /dev/mem is bad and it was an example but such SW errors can happen elsewhere to, we got one from a badly configured UIO device as well. HW errors we just have to live with but I hoped we could handle some SW errors better. Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 14:01 ` Joakim Tjernlund @ 2022-03-24 14:17 ` Marc Zyngier 2022-03-24 14:50 ` Joakim Tjernlund 0 siblings, 1 reply; 10+ messages in thread From: Marc Zyngier @ 2022-03-24 14:17 UTC (permalink / raw) To: Joakim Tjernlund Cc: robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org On Thu, 24 Mar 2022 14:01:53 +0000, Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: > > On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote: > > On 2022-03-24 12:10, Joakim Tjernlund wrote: > > > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: > > > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write > > > > > > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError > > > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 > > > [ 37.573150] Hardware name: infinera,xr (DT) > > > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) > > > [ 37.574705] pc : 000000000098775c > > > [ 37.575063] lr : 0000000000986918 > > > [ 37.575392] sp : 00000000ffd140a8 > > > [ 37.575725] x12: 0000000000a36c10 > > > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 > > > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c > > > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 > > > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 > > > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt > > > [ 37.582685] Kernel Offset: disabled > > > [ 37.582932] CPU features: 0x00001001,20000842 > > > [ 37.583509] Memory Limit: none > > > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- > > > > > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. > > > Is this what to expect? > > > I see that kernel looks for the RAS extension but we don't have that. > > > > > > Can anything be done not to panic the kernel for such accesses? > > > > No. The error comes back to the CPU in an unattributable manner, so all > > it knows is that *something*, at some point in the past, went > > catastrophically wrong. Saying "this is fine..." and carrying on > > regardless isn't really viable. IIRC the RAS extension places > > constraints on the delivery of async SError such that it's slightly more > > possible to do something with, but without that all bets are off. > > And this is because we don't have RAS? If we did have RAS > would/could kernel sort out the error and the app would get an > SIGBUS or similar? With RAS, the error would be containable, and attributed to the userspace task by the kernel on the next exception. Without RAS, panic is the only option, as we have no idea what the damage is. The machine is on fire, for all we know. > > > > > > Can one build a som sort of blacklisted address spaces which the MMU will block? > > > > Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never > > access anything invalid. > > I'm not even entirely joking there - even for address ranges that the > > kernel *does* know about, you can still SError or deadlock by poking at > > something that's currently clock-gated or powered off, or lose coherency > > and cause corruption by accessing memory with the wrong attributes; at > > worst writing the wrong thing to the wrong place may even physically > > damage the hardware. > > > I know /dev/mem is bad and it was an example but such SW errors can > happen elsewhere to, we got one from a badly configured UIO device > as well. HW errors we just have to live with but I hoped we could > handle some SW errors better. I think you have the wrong end of the stick here. This *is* a HW error, and the HW tells you so in no uncertain terms that something is really bad. If the device is supposed to be assignable to userspace, it either must be designed not to respond with a SError no matter what userspace is throwing at it (because let's face it, userspace will eventually do something really bad), or the whole system must be designed in a way that such error can be contained and attributed to the offending party. Just giving userspace any odd device and hoping that it will all be fine is unfortunately wishful thinking. M. -- Without deviation from the norm, progress is not possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 14:17 ` Marc Zyngier @ 2022-03-24 14:50 ` Joakim Tjernlund 2022-03-24 15:05 ` Robin Murphy 0 siblings, 1 reply; 10+ messages in thread From: Joakim Tjernlund @ 2022-03-24 14:50 UTC (permalink / raw) To: maz@kernel.org; +Cc: robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org On Thu, 2022-03-24 at 14:17 +0000, Marc Zyngier wrote: > On Thu, 24 Mar 2022 14:01:53 +0000, > Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: > > > > On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote: > > > On 2022-03-24 12:10, Joakim Tjernlund wrote: > > > > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: > > > > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write > > > > > > > > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError > > > > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 > > > > [ 37.573150] Hardware name: infinera,xr (DT) > > > > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) > > > > [ 37.574705] pc : 000000000098775c > > > > [ 37.575063] lr : 0000000000986918 > > > > [ 37.575392] sp : 00000000ffd140a8 > > > > [ 37.575725] x12: 0000000000a36c10 > > > > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 > > > > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c > > > > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 > > > > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 > > > > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt > > > > [ 37.582685] Kernel Offset: disabled > > > > [ 37.582932] CPU features: 0x00001001,20000842 > > > > [ 37.583509] Memory Limit: none > > > > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- > > > > > > > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. > > > > Is this what to expect? > > > > I see that kernel looks for the RAS extension but we don't have that. > > > > > > > > Can anything be done not to panic the kernel for such accesses? > > > > > > No. The error comes back to the CPU in an unattributable manner, so all > > > it knows is that *something*, at some point in the past, went > > > catastrophically wrong. Saying "this is fine..." and carrying on > > > regardless isn't really viable. IIRC the RAS extension places > > > constraints on the delivery of async SError such that it's slightly more > > > possible to do something with, but without that all bets are off. > > > > And this is because we don't have RAS? If we did have RAS > > would/could kernel sort out the error and the app would get an > > SIGBUS or similar? > > With RAS, the error would be containable, and attributed to the > userspace task by the kernel on the next exception. Without RAS, panic > is the only option, as we have no idea what the damage is. The machine > is on fire, for all we know. Thanks, now I know. > > > > > > > > > > Can one build a som sort of blacklisted address spaces which the MMU will block? > > > > > > Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never > > > access anything invalid. > > > I'm not even entirely joking there - even for address ranges that the > > > kernel *does* know about, you can still SError or deadlock by poking at > > > something that's currently clock-gated or powered off, or lose coherency > > > and cause corruption by accessing memory with the wrong attributes; at > > > worst writing the wrong thing to the wrong place may even physically > > > damage the hardware. > > > > > I know /dev/mem is bad and it was an example but such SW errors can > > happen elsewhere to, we got one from a badly configured UIO device > > as well. HW errors we just have to live with but I hoped we could > > handle some SW errors better. > > I think you have the wrong end of the stick here. This *is* a HW > error, and the HW tells you so in no uncertain terms that something is > really bad. Yes, SW induced HW error is a better description. > > If the device is supposed to be assignable to userspace, it either > must be designed not to respond with a SError no matter what userspace > is throwing at it (because let's face it, userspace will eventually do > something really bad), or the whole system must be designed in a way > that such error can be contained and attributed to the offending > party. > > Just giving userspace any odd device and hoping that it will all be > fine is unfortunately wishful thinking. Sure, just want to limit the damage where I can. A ptr access to non existing space is not really harmful and I want the app to take the hit for it. At least then you can log/trouble shoot easier. Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 14:50 ` Joakim Tjernlund @ 2022-03-24 15:05 ` Robin Murphy 2022-03-24 15:11 ` Joakim Tjernlund 0 siblings, 1 reply; 10+ messages in thread From: Robin Murphy @ 2022-03-24 15:05 UTC (permalink / raw) To: Joakim Tjernlund, maz@kernel.org; +Cc: linux-arm-kernel@lists.infradead.org On 2022-03-24 14:50, Joakim Tjernlund wrote: > On Thu, 2022-03-24 at 14:17 +0000, Marc Zyngier wrote: >> On Thu, 24 Mar 2022 14:01:53 +0000, >> Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: >>> >>> On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote: >>>> On 2022-03-24 12:10, Joakim Tjernlund wrote: >>>>> We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: >>>>> # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write >>>>> >>>>> [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError >>>>> [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 >>>>> [ 37.573150] Hardware name: infinera,xr (DT) >>>>> [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) >>>>> [ 37.574705] pc : 000000000098775c >>>>> [ 37.575063] lr : 0000000000986918 >>>>> [ 37.575392] sp : 00000000ffd140a8 >>>>> [ 37.575725] x12: 0000000000a36c10 >>>>> [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 >>>>> [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c >>>>> [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 >>>>> [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 >>>>> [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt >>>>> [ 37.582685] Kernel Offset: disabled >>>>> [ 37.582932] CPU features: 0x00001001,20000842 >>>>> [ 37.583509] Memory Limit: none >>>>> [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- >>>>> >>>>> and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. >>>>> Is this what to expect? >>>>> I see that kernel looks for the RAS extension but we don't have that. >>>>> >>>>> Can anything be done not to panic the kernel for such accesses? >>>> >>>> No. The error comes back to the CPU in an unattributable manner, so all >>>> it knows is that *something*, at some point in the past, went >>>> catastrophically wrong. Saying "this is fine..." and carrying on >>>> regardless isn't really viable. IIRC the RAS extension places >>>> constraints on the delivery of async SError such that it's slightly more >>>> possible to do something with, but without that all bets are off. >>> >>> And this is because we don't have RAS? If we did have RAS >>> would/could kernel sort out the error and the app would get an >>> SIGBUS or similar? >> >> With RAS, the error would be containable, and attributed to the >> userspace task by the kernel on the next exception. Without RAS, panic >> is the only option, as we have no idea what the damage is. The machine >> is on fire, for all we know. > > Thanks, now I know. > >> >>> >>>> >>>>> Can one build a som sort of blacklisted address spaces which the MMU will block? >>>> >>>> Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never >>>> access anything invalid. >>>> I'm not even entirely joking there - even for address ranges that the >>>> kernel *does* know about, you can still SError or deadlock by poking at >>>> something that's currently clock-gated or powered off, or lose coherency >>>> and cause corruption by accessing memory with the wrong attributes; at >>>> worst writing the wrong thing to the wrong place may even physically >>>> damage the hardware. >>>> >>> I know /dev/mem is bad and it was an example but such SW errors can >>> happen elsewhere to, we got one from a badly configured UIO device >>> as well. HW errors we just have to live with but I hoped we could >>> handle some SW errors better. >> >> I think you have the wrong end of the stick here. This *is* a HW >> error, and the HW tells you so in no uncertain terms that something is >> really bad. > > Yes, SW induced HW error is a better description. > >> >> If the device is supposed to be assignable to userspace, it either >> must be designed not to respond with a SError no matter what userspace >> is throwing at it (because let's face it, userspace will eventually do >> something really bad), or the whole system must be designed in a way >> that such error can be contained and attributed to the offending >> party. >> >> Just giving userspace any odd device and hoping that it will all be >> fine is unfortunately wishful thinking. > > Sure, just want to limit the damage where I can. A ptr access to non existing space is not really harmful Well, except when it is... try that on a Qualcomm SoC where the EL2 firmware will trap you and reset the system before you even know you've done anything wrong. If you know enough to know that an error triggered by accessing some address is truly benign, you know enough to avoid making that access in the first place. Robin. > and I want the app to take the hit for it. At least then you can log/trouble shoot easier. > > Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 15:05 ` Robin Murphy @ 2022-03-24 15:11 ` Joakim Tjernlund 2022-03-24 15:25 ` Marc Zyngier 0 siblings, 1 reply; 10+ messages in thread From: Joakim Tjernlund @ 2022-03-24 15:11 UTC (permalink / raw) To: robin.murphy@arm.com, maz@kernel.org; +Cc: linux-arm-kernel@lists.infradead.org On Thu, 2022-03-24 at 15:05 +0000, Robin Murphy wrote: > On 2022-03-24 14:50, Joakim Tjernlund wrote: > > On Thu, 2022-03-24 at 14:17 +0000, Marc Zyngier wrote: > > > On Thu, 24 Mar 2022 14:01:53 +0000, > > > Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: > > > > > > > > On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote: > > > > > On 2022-03-24 12:10, Joakim Tjernlund wrote: > > > > > > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports: > > > > > > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write > > > > > > > > > > > > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError > > > > > > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26 > > > > > > [ 37.573150] Hardware name: infinera,xr (DT) > > > > > > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS) > > > > > > [ 37.574705] pc : 000000000098775c > > > > > > [ 37.575063] lr : 0000000000986918 > > > > > > [ 37.575392] sp : 00000000ffd140a8 > > > > > > [ 37.575725] x12: 0000000000a36c10 > > > > > > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020 > > > > > > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c > > > > > > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020 > > > > > > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000 > > > > > > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt > > > > > > [ 37.582685] Kernel Offset: disabled > > > > > > [ 37.582932] CPU features: 0x00001001,20000842 > > > > > > [ 37.583509] Memory Limit: none > > > > > > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- > > > > > > > > > > > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS. > > > > > > Is this what to expect? > > > > > > I see that kernel looks for the RAS extension but we don't have that. > > > > > > > > > > > > Can anything be done not to panic the kernel for such accesses? > > > > > > > > > > No. The error comes back to the CPU in an unattributable manner, so all > > > > > it knows is that *something*, at some point in the past, went > > > > > catastrophically wrong. Saying "this is fine..." and carrying on > > > > > regardless isn't really viable. IIRC the RAS extension places > > > > > constraints on the delivery of async SError such that it's slightly more > > > > > possible to do something with, but without that all bets are off. > > > > > > > > And this is because we don't have RAS? If we did have RAS > > > > would/could kernel sort out the error and the app would get an > > > > SIGBUS or similar? > > > > > > With RAS, the error would be containable, and attributed to the > > > userspace task by the kernel on the next exception. Without RAS, panic > > > is the only option, as we have no idea what the damage is. The machine > > > is on fire, for all we know. > > > > Thanks, now I know. > > > > > > > > > > > > > > > > > > > > Can one build a som sort of blacklisted address spaces which the MMU will block? > > > > > > > > > > Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never > > > > > access anything invalid. > > > > > I'm not even entirely joking there - even for address ranges that the > > > > > kernel *does* know about, you can still SError or deadlock by poking at > > > > > something that's currently clock-gated or powered off, or lose coherency > > > > > and cause corruption by accessing memory with the wrong attributes; at > > > > > worst writing the wrong thing to the wrong place may even physically > > > > > damage the hardware. > > > > > > > > > I know /dev/mem is bad and it was an example but such SW errors can > > > > happen elsewhere to, we got one from a badly configured UIO device > > > > as well. HW errors we just have to live with but I hoped we could > > > > handle some SW errors better. > > > > > > I think you have the wrong end of the stick here. This *is* a HW > > > error, and the HW tells you so in no uncertain terms that something is > > > really bad. > > > > Yes, SW induced HW error is a better description. > > > > > > > > If the device is supposed to be assignable to userspace, it either > > > must be designed not to respond with a SError no matter what userspace > > > is throwing at it (because let's face it, userspace will eventually do > > > something really bad), or the whole system must be designed in a way > > > that such error can be contained and attributed to the offending > > > party. > > > > > > Just giving userspace any odd device and hoping that it will all be > > > fine is unfortunately wishful thinking. > > > > Sure, just want to limit the damage where I can. A ptr access to non existing space is not really harmful > > Well, except when it is... try that on a Qualcomm SoC where the EL2 > firmware will trap you and reset the system before you even know you've > done anything wrong. If you know enough to know that an error triggered > by accessing some address is truly benign, you know enough to avoid > making that access in the first place. of course the error will be dealt with but why make bug finding harder than it has to be? Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 15:11 ` Joakim Tjernlund @ 2022-03-24 15:25 ` Marc Zyngier 2022-03-24 15:42 ` Joakim Tjernlund 0 siblings, 1 reply; 10+ messages in thread From: Marc Zyngier @ 2022-03-24 15:25 UTC (permalink / raw) To: Joakim Tjernlund Cc: robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org On Thu, 24 Mar 2022 15:11:42 +0000, Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: > > On Thu, 2022-03-24 at 15:05 +0000, Robin Murphy wrote: > > Well, except when it is... try that on a Qualcomm SoC where the EL2 > > firmware will trap you and reset the system before you even know you've > > done anything wrong. If you know enough to know that an error triggered > > by accessing some address is truly benign, you know enough to avoid > > making that access in the first place. > > of course the error will be dealt with but why make bug finding > harder than it has to be? Maybe that was not clear enough from our earlier replies. Let me try again. There is *nothing* more the kernel can do. We don't even know what caused the access (read, write, earthquake or foreign power invasion). By the time we get the SError interrupt, we could well be running something altogether different because all of that is totally asynchronous *by nature*. You're just lucky that you get the response quickly enough that the kernel is still running the offending userspace. M. -- Without deviation from the norm, progress is not possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 15:25 ` Marc Zyngier @ 2022-03-24 15:42 ` Joakim Tjernlund 2022-03-24 15:54 ` Robin Murphy 0 siblings, 1 reply; 10+ messages in thread From: Joakim Tjernlund @ 2022-03-24 15:42 UTC (permalink / raw) To: maz@kernel.org; +Cc: robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org On Thu, 2022-03-24 at 15:25 +0000, Marc Zyngier wrote: > On Thu, 24 Mar 2022 15:11:42 +0000, > Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: > > > > On Thu, 2022-03-24 at 15:05 +0000, Robin Murphy wrote: > > > > Well, except when it is... try that on a Qualcomm SoC where the EL2 > > > firmware will trap you and reset the system before you even know you've > > > done anything wrong. If you know enough to know that an error triggered > > > by accessing some address is truly benign, you know enough to avoid > > > making that access in the first place. > > > > of course the error will be dealt with but why make bug finding > > harder than it has to be? > > Maybe that was not clear enough from our earlier replies. Let me try > again. > > There is *nothing* more the kernel can do. We don't even know what > caused the access (read, write, earthquake or foreign power invasion). > > By the time we get the SError interrupt, we could well be running > something altogether different because all of that is totally > asynchronous *by nature*. You're just lucky that you get the response > quickly enough that the kernel is still running the offending > userspace. I worked ppc earlier and there am used to get an exception(MachineCheck) with PC and Data address for similar cases and can usually pass that on to user space as a SIGBUS and kernel moves along. Seems ARM works very differently and pulls the plug directly, just finding it odd though. Jocke _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: SError Interrupt on CPU0, code 0xbf000000 makes kernel panic 2022-03-24 15:42 ` Joakim Tjernlund @ 2022-03-24 15:54 ` Robin Murphy 0 siblings, 0 replies; 10+ messages in thread From: Robin Murphy @ 2022-03-24 15:54 UTC (permalink / raw) To: Joakim Tjernlund, maz@kernel.org; +Cc: linux-arm-kernel@lists.infradead.org On 2022-03-24 15:42, Joakim Tjernlund wrote: > On Thu, 2022-03-24 at 15:25 +0000, Marc Zyngier wrote: >> On Thu, 24 Mar 2022 15:11:42 +0000, >> Joakim Tjernlund <Joakim.Tjernlund@infinera.com> wrote: >>> >>> On Thu, 2022-03-24 at 15:05 +0000, Robin Murphy wrote: >> >>>> Well, except when it is... try that on a Qualcomm SoC where the EL2 >>>> firmware will trap you and reset the system before you even know you've >>>> done anything wrong. If you know enough to know that an error triggered >>>> by accessing some address is truly benign, you know enough to avoid >>>> making that access in the first place. >>> >>> of course the error will be dealt with but why make bug finding >>> harder than it has to be? >> >> Maybe that was not clear enough from our earlier replies. Let me try >> again. >> >> There is *nothing* more the kernel can do. We don't even know what >> caused the access (read, write, earthquake or foreign power invasion). >> >> By the time we get the SError interrupt, we could well be running >> something altogether different because all of that is totally >> asynchronous *by nature*. You're just lucky that you get the response >> quickly enough that the kernel is still running the offending >> userspace. > > I worked ppc earlier and there am used to get an exception(MachineCheck) with PC and Data address > for similar cases and can usually pass that on to user space as a SIGBUS and kernel moves along. > > Seems ARM works very differently and pulls the plug directly, just finding it odd though. Linux necessarily has to operate within the bounds of the architecture on which it's running, while you as an external observer of the entire system do not. If you find it inconvenient that Linux handles an unattributable error by not attributing it to the cause that your higher-level comparatively omnipotent knowledge can, and you are confident that on *your* system in *your* debugging scenario, there are no other possible sources of unattributable errors, then feel free to hack Linux locally to not panic on an unattributable error. Just understand why it's a local hack and you won't be sending a patch upstream. Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-03-24 15:55 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-24 12:10 SError Interrupt on CPU0, code 0xbf000000 makes kernel panic Joakim Tjernlund 2022-03-24 13:16 ` Robin Murphy 2022-03-24 14:01 ` Joakim Tjernlund 2022-03-24 14:17 ` Marc Zyngier 2022-03-24 14:50 ` Joakim Tjernlund 2022-03-24 15:05 ` Robin Murphy 2022-03-24 15:11 ` Joakim Tjernlund 2022-03-24 15:25 ` Marc Zyngier 2022-03-24 15:42 ` Joakim Tjernlund 2022-03-24 15:54 ` Robin Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).