Embedded Linux development
 help / color / mirror / Atom feed
* Re: File system robustness
From: Alan C. Assis @ 2023-07-18 13:04 UTC (permalink / raw)
  To: Bjørn Forsman
  Cc: Kai Tomerius, linux-embedded, Ext4 Developers List, dm-devel
In-Reply-To: <CAEYzJUGC8Yj1dQGsLADT+pB-mkac0TAC-typAORtX7SQ1kVt+g@mail.gmail.com>

Hi Bjørn,

On 7/18/23, Bjørn Forsman <bjorn.forsman@gmail.com> wrote:
> On Tue, 18 Jul 2023 at 08:03, Kai Tomerius <kai@tomerius.de> wrote:
>> I should have mentioned that I'll have a large NAND flash, so ext4
>> might still be the file system of choice. The other ones you mentioned
>> are interesting to consider, but seem to be more fitting for a smaller
>> NOR flash.
>
> If you mean raw NAND flash I would think UBIFS is still the way to go?
> (It's been several years since I was into embedded Linux systems.)
>
> https://elinux.org/images/0/02/Filesystem_Considerations_for_Embedded_Devices.pdf
> is focused on eMMC/SD Cards, which have built-in controllers that
> enable them to present a block device interface, which is very unlike
> what raw NAND devices have.
>
> Please see https://www.kernel.org/doc/html/latest/filesystems/ubifs.html
> for more info.
>

You are right, for NAND there is an old (but gold) presentation here:

https://elinux.org/images/7/7e/ELC2009-FlashFS-Toshiba.pdf

UBIFS and YAFFS2 are the way to go.

But please note that YAFFS2 needs license payment for commercial
application (something that I only discovered recently when Xiaomi
integrated it into NuttX mainline, bad surprise).

BR,

Alan

^ permalink raw reply

* Re: File system robustness
From: Chris @ 2023-07-18 14:47 UTC (permalink / raw)
  To: Alan C. Assis, Bjørn Forsman
  Cc: dm-devel, Ext4 Developers List, linux-embedded, Kai Tomerius
In-Reply-To: <CAG4Y6eTN1XbZ_jAdX+t2mkEN=KoNOqprrCqtX0BVfaH6AxkdtQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 1634 bytes --]

Hi Bjørn,

If I may summarize, for Linux with raw NAND flash, your main option is UBIFS. You can also use UBI + squashfs if you really want to save space

For Linux with managed flash (e.g. eMMC or UFS), most people go with EXT4 or F2FS

HTH,
Chris

On 18 July 2023 14:04:55 BST, "Alan C. Assis" <acassis@gmail.com> wrote:
>Hi Bjørn,
>
>On 7/18/23, Bjørn Forsman <bjorn.forsman@gmail.com> wrote:
>> On Tue, 18 Jul 2023 at 08:03, Kai Tomerius <kai@tomerius.de> wrote:
>>> I should have mentioned that I'll have a large NAND flash, so ext4
>>> might still be the file system of choice. The other ones you mentioned
>>> are interesting to consider, but seem to be more fitting for a smaller
>>> NOR flash.
>>
>> If you mean raw NAND flash I would think UBIFS is still the way to go?
>> (It's been several years since I was into embedded Linux systems.)
>>
>> https://elinux.org/images/0/02/Filesystem_Considerations_for_Embedded_Devices.pdf
>> is focused on eMMC/SD Cards, which have built-in controllers that
>> enable them to present a block device interface, which is very unlike
>> what raw NAND devices have.
>>
>> Please see https://www.kernel.org/doc/html/latest/filesystems/ubifs.html
>> for more info.
>>
>
>You are right, for NAND there is an old (but gold) presentation here:
>
>https://elinux.org/images/7/7e/ELC2009-FlashFS-Toshiba.pdf
>
>UBIFS and YAFFS2 are the way to go.
>
>But please note that YAFFS2 needs license payment for commercial
>application (something that I only discovered recently when Xiaomi
>integrated it into NuttX mainline, bad surprise).
>
>BR,
>
>Alan

[-- Attachment #1.2: Type: text/html, Size: 2577 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* Re: File system robustness
From: Theodore Ts'o @ 2023-07-18 21:32 UTC (permalink / raw)
  To: Alan C. Assis
  Cc: Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <CAG4Y6eTN1XbZ_jAdX+t2mkEN=KoNOqprrCqtX0BVfaH6AxkdtQ@mail.gmail.com>

On Tue, Jul 18, 2023 at 10:04:55AM -0300, Alan C. Assis wrote:
> 
> You are right, for NAND there is an old (but gold) presentation here:
> 
> https://elinux.org/images/7/7e/ELC2009-FlashFS-Toshiba.pdf
> 
> UBIFS and YAFFS2 are the way to go.

This presentation is specifically talking about flash devices that do
not have a flash translation layer (that is, they are using the MTD
interface).

There are multiple kinds of flash devices, that can be exported via
different interfaces: MTD, USB Storage, eMMC, UFS, SATA, SCSI, NVMe,
etc.  There are also differences in terms of the sophistication of the
Flash Translation Layer in terms of how powerful is the
microcontroller, how much memory and persistant storage for flash
metadata is available to the FTL, etc.

F2FS is a good choice for "low end flash", especially those flash
devices that use a very simplistic mapping between LBA (block/sector
numbers) and the physical flash to be used, and may have a very
limited number of flash blocks that can be open for modification at a
time.  For more sophiscated flash storage devices (e.g., SSD's and
higher end flash devices), this consideration won't matter, and then
the best file system to use will be very dependant on your workload.

In answer to Kai's original question, the setup that was described
should be fine --- assuming high quality hardware.  There are some
flash devices that designed to handle power failures correctly; which
is to say, if power is cut suddenly, the data used by the Flash
Translation Layer can be corrupted, in which case data written months
or years ago (not just recent data) could be lost.  There have been
horror stories about wedding photographers who dropped their camera,
and the SD Card came shooting out, and *all* of the data that was shot
on some couple's special day was completely *gone*.

Assuming that you have valid, power drop safe hardware, running fsck
after a power cut is not necessary, at least as far as file system
consistency is concerned.  If you have badly written userspace
application code, then all bets can be off.  For example, consider the
following sequence of events:

1)  An application like Tuxracer truncates the top-ten score file
2)  It then writes a new top-ten score file
3)  <Fail to call fsync, or write the file to a foo.new and then
       rename on top of the old version of the file>
4)  Ut then closes the Open GL library, triggering a bug in the cruddy
    proprietary binary-only kernel module video driver,
    leading to an immediate system crash.
5)  Complain to the file system developers that users' top-ten score
    file was lost, and what are the file system developers going to
    do about it?
6)  File system developers start creating T-shirts saying that what userspace
    applications really are asking for is a new open(2) flag, O_PONIES[1]

[1] https://blahg.josefsipek.net/?p=364

So when you talk about overall system robustness, you need robust
hardware, you need a robust file aystem, you need to use the file
system correctly, and you have robust userspace applications.

If you get it all right, you'll be fine.  On the other hand, if you
have crappy hardware (such as might be found for cheap in the checkout
counter of the local Micro Center, or in a back alley vendor in
Shenzhen, China), or if you do something like misconfigure the file
system such as using the "nobarrier" mount option "to speed things
up", or if you have applications that update files in an unsafe
manner, then you will have problems.

Welcome to systems engineering.  :-)

						- Ted

^ permalink raw reply

* Re: File system robustness
From: Martin Steigerwald @ 2023-07-19  6:22 UTC (permalink / raw)
  To: Alan C. Assis, Theodore Ts'o
  Cc: Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <20230718213212.GE3842864@mit.edu>

Theodore Ts'o - 18.07.23, 23:32:12 CEST:
> If you get it all right, you'll be fine.  On the other hand, if you
> have crappy hardware (such as might be found for cheap in the checkout
> counter of the local Micro Center, or in a back alley vendor in
> Shenzhen, China), or if you do something like misconfigure the file
> system such as using the "nobarrier" mount option "to speed things
> up", or if you have applications that update files in an unsafe
> manner, then you will have problems.

Is "nobarrier" mount option still a thing? I thought those mount options 
have been deprecated or even removed with the introduction of cache flush 
handling in kernel 2.6.37?

Hmm, the mount option has been removed from XFS in in kernel 4.19 
according to manpage, however no mention of any deprecation or removal 
in ext4 manpage. It also does not seem to be removed in BTRFS at least 
according to manpage btrfs(5).

-- 
Martin



^ permalink raw reply

* Re: File system robustness
From: Kai Tomerius @ 2023-07-19 10:51 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Alan C. Assis, Bjørn Forsman, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <20230718213212.GE3842864@mit.edu>

> In answer to Kai's original question, the setup that was described
> should be fine --- assuming high quality hardware.

I wonder how to judge that ... it's an eMMC supposedly complying to
some JEDEC standard, so it *should* be ok.

> ... if power is cut suddenly, the data used by the Flash
> Translation Layer can be corrupted, in which case data written months
> or years ago (not just recent data) could be lost.

At least I haven't observed anything like that up to now.

But on another aspect: how about the interaction between dm-integrity
and ext4? Sure, they each have their own journal, and they're
independent layers. Is there anything that could go wrong, say a block
that can't be recovered in the dm-integrity layer, causing ext4 to run
into trouble, e.g., an I/O error that prevents ext4 from mounting?

I assume tne answer is "No", but can I be sure?

Thx
regards
Kai

^ permalink raw reply

* Re: File system robustness
From: Theodore Ts'o @ 2023-07-20  4:20 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Alan C. Assis, Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <4835096.GXAFRqVoOG@lichtvoll.de>

On Wed, Jul 19, 2023 at 08:22:43AM +0200, Martin Steigerwald wrote:
> 
> Is "nobarrier" mount option still a thing? I thought those mount options 
> have been deprecated or even removed with the introduction of cache flush 
> handling in kernel 2.6.37?

Yes, it's a thing, and if your server has a UPS with a reliable power
failure / low battery feedback, it's *possible* to engineer a reliable
system.  Or, for example, if you have a phone with an integrated
battery, so when you drop it the battery compartment won't open and
the battery won't go flying out, *and* the baseboard management
controller (BMC) will halt the CPU before the battery complete dies,
and gives a chance for the flash storage device to commit everything
before shutdown, *and* the BMC arranges to make sure the same thing
happens when the user pushes and holds the power button for 30
seconds, then it could be safe.

We also use nobarrier for a scratch file systems which by definition
go away when the borg/kubernetes job dies, and which will *never*
survive a reboot, let alone a power failure.  In such a situation,
there's no point sending the cache flush, because the partition will
be mkfs'ed on reboot.  Or, in if the iSCSI or Cloud Persistent Disk
will *always* go away when the VM dies, because any persistent state
is saved to some cluster or distributed file store (e.g., to the MySQL
server, or Big Table, or Spanner, etc.  In these cases, you don't
*want* the Cache Flush operation, since skipping it reduce I/O
overhead.

So if you know what you are doing, in certain specialized use cases,
nobarrier can make sense, and it is used today at my $WORK's data
center for production jobs *all* the time.  So we won't be making
ext4's nobarrier mount option go away; it has users.  :-)

Cheers,

					- Ted

^ permalink raw reply

* Re: File system robustness
From: Theodore Ts'o @ 2023-07-20  4:41 UTC (permalink / raw)
  To: Kai Tomerius
  Cc: Alan C. Assis, Bjørn Forsman, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <20230719105138.GA19936@tomerius.de>

On Wed, Jul 19, 2023 at 12:51:39PM +0200, Kai Tomerius wrote:
> > In answer to Kai's original question, the setup that was described
> > should be fine --- assuming high quality hardware.
> 
> I wonder how to judge that ... it's an eMMC supposedly complying to
> some JEDEC standard, so it *should* be ok.

JEDEC promulgates the eMMC interface specification.  That's the
interface used to talk to the device, much like SATA and SCSI and
NVMe.  The JEDEC eMMC specification says nothing about the quality of
the implementation of the FTL, or whether it is safe from power drops,
or how many wirte cycles are supported before the eMMC soldered on the
$2000 MCU would expire.

If you're a cell phone manufacturer, the way you judge it is *before*
you buy a few million of the eMMC devices, you subject the samples to
a huge amount of power drops and other torture tests (including
verifying the claimed number of write cycles in spec sheet), before
the device is qualified for use in your product.

> But on another aspect: how about the interaction between dm-integrity
> and ext4? Sure, they each have their own journal, and they're
> independent layers. Is there anything that could go wrong, say a block
> that can't be recovered in the dm-integrity layer, causing ext4 to run
> into trouble, e.g., an I/O error that prevents ext4 from mounting?
> 
> I assume tne answer is "No", but can I be sure?

If there are I/O errors, with or without dm-integrity, you can have
problems.  dm-integrity will turn bit-flips into hard I/O errors, but
a bit-flip might cause silent file system cocrruption (at least at
first), such that when you finally notice that there's a problem,
several days or weeks or months may have passed, the data loss might
be far worse.  So turning an innocous bit flip into a hard I/O error
can be a feature, assuming that you've allowed for it in your system
architecture.

If you assume that the hardware doesn't introduce I/O errors or bit
flips, and if you assume you don't have any attackers trying to
corrupt the block device with bit flips, then sure, nothing will go
wrong.  You can buy perfect hardware from the same supply store where
high school physics teachers buy frictionless pulleys and massless
ropes.  :-)

Cheers,

						- Ted

^ permalink raw reply

* Nobarrier mount option (was: Re: File system robustness)
From: Martin Steigerwald @ 2023-07-20  7:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Alan C. Assis, Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <20230720042034.GA5764@mit.edu>

Theodore Ts'o - 20.07.23, 06:20:34 CEST:
> On Wed, Jul 19, 2023 at 08:22:43AM +0200, Martin Steigerwald wrote:
> > Is "nobarrier" mount option still a thing? I thought those mount
> > options have been deprecated or even removed with the introduction
> > of cache flush handling in kernel 2.6.37?
> 
> Yes, it's a thing, and if your server has a UPS with a reliable power
> failure / low battery feedback, it's *possible* to engineer a reliable
> system.  Or, for example, if you have a phone with an integrated
> battery, so when you drop it the battery compartment won't open and
> the battery won't go flying out, *and* the baseboard management
> controller (BMC) will halt the CPU before the battery complete dies,
> and gives a chance for the flash storage device to commit everything
> before shutdown, *and* the BMC arranges to make sure the same thing
> happens when the user pushes and holds the power button for 30
> seconds, then it could be safe.

Thanks for clarification. I am aware that something like this can be 
done. But I did not think that is would be necessary to explicitly 
disable barriers, or should I more accurately write cache flushes, in 
such a case:

I thought that nowadays a cache flush would be (almost) a no-op in the 
case the storage receiving it is backed by such reliability measures. 
I.e. that the hardware just says "I am ready" when having the I/O 
request in stable storage whatever that would be, even in case that 
would be battery backed NVRAM and/or temporary flash.

At least that is what I thought was the background for not doing the 
"nobarrier" thing anymore: Let the storage below decide whether it is 
safe to basically ignore cache flushes by answering them (almost) 
immediately.

However, not sending the cache flushes in the first place would likely 
still be more efficient although as far as I am aware block layer does not 
return back a success / failure information to the upper layers anymore 
since kernel 2.6.37.

Seems I got to update my Linux Performance tuning slides about this once 
again.

> We also use nobarrier for a scratch file systems which by definition
> go away when the borg/kubernetes job dies, and which will *never*
> survive a reboot, let alone a power failure.  In such a situation,
> there's no point sending the cache flush, because the partition will
> be mkfs'ed on reboot.  Or, in if the iSCSI or Cloud Persistent Disk
> will *always* go away when the VM dies, because any persistent state
> is saved to some cluster or distributed file store (e.g., to the MySQL
> server, or Big Table, or Spanner, etc.  In these cases, you don't
> *want* the Cache Flush operation, since skipping it reduce I/O
> overhead.

Hmm, right.

> So if you know what you are doing, in certain specialized use cases,
> nobarrier can make sense, and it is used today at my $WORK's data
> center for production jobs *all* the time.  So we won't be making
> ext4's nobarrier mount option go away; it has users.  :-)

I now wonder why XFS people deprecated and even removed those mount 
options. But maybe I better ask them separately instead of adding their 
list in CC. Probably by forwarding this mail to the XFS mailing list 
later on.

Best,
-- 
Martin



^ permalink raw reply

* Re: Nobarrier mount option (was: Re: File system robustness)
From: Theodore Ts'o @ 2023-07-21 13:35 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Alan C. Assis, Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <38426448.10thIPus4b@lichtvoll.de>

On Thu, Jul 20, 2023 at 09:55:22AM +0200, Martin Steigerwald wrote:
> 
> I thought that nowadays a cache flush would be (almost) a no-op in the 
> case the storage receiving it is backed by such reliability measures. 
> I.e. that the hardware just says "I am ready" when having the I/O 
> request in stable storage whatever that would be, even in case that 
> would be battery backed NVRAM and/or temporary flash.

That *can* be true if the storage subsystem has the reliability
measures.  For example, if have a $$$ EMC storage array, then sure, it
has an internal UPS backup and it will know that it can ignore that
CACHE FLUSH request.

However, if you have *building* a storage system, the storage device
might be a HDD who has no idea that that it doesn't need to worry
about power drops.  Consider if you will, a rack of servers, each with
a dozen or more HDD's.  There is a rack-level battery backup, and the
rack is located in a data center with diesel generators with enough
fuel supply to keep the entire data center, plus cooling, going for
days.  The rack of servers is part of a cluster file system.  So when
a file write to a cluster file system is performed, the cluster file
system will pick three servers, each in a different rack, and each
rack is in a different power distribution domain.  That way, even the
entry-level switch on the rack dies, or the Power Distribution Unit
(PDU) servicing a group of racks blows up, the data will be available
on the other two servers.

> At least that is what I thought was the background for not doing the 
> "nobarrier" thing anymore: Let the storage below decide whether it is 
> safe to basically ignore cache flushes by answering them (almost) 
> immediately.

The problem is that the storage below (e.g., the HDD) has no idea that
all of this redundancy exists.  Only the system adminsitrator who is
configuring the file sysetm will know.  And if you are runninig a
hyper-scale cloud system, this kind of custom made system will be
much, MUCH, cheaper than buying a huge number of $$$ EMC storage
arrays.

Cheers,

					- Ted

^ permalink raw reply

* Re: Nobarrier mount option (was: Re: File system robustness)
From: Martin Steigerwald @ 2023-07-21 14:51 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Alan C. Assis, Bjørn Forsman, Kai Tomerius, linux-embedded,
	Ext4 Developers List, dm-devel
In-Reply-To: <20230721133526.GF5764@mit.edu>

Theodore Ts'o - 21.07.23, 15:35:26 CEST:
> > At least that is what I thought was the background for not doing the
> > "nobarrier" thing anymore: Let the storage below decide whether it
> > is safe to basically ignore cache flushes by answering them (almost)
> > immediately.
> 
> The problem is that the storage below (e.g., the HDD) has no idea that
> all of this redundancy exists.  Only the system adminsitrator who is
> configuring the file sysetm will know.  And if you are runninig a
> hyper-scale cloud system, this kind of custom made system will be
> much, MUCH, cheaper than buying a huge number of $$$ EMC storage
> arrays.

Okay, that is reasonable.

Thanks for explaining.

-- 
Martin



^ permalink raw reply

* PSA: migrating linux-embedded to new vger infrastructure
From: Konstantin Ryabitsev @ 2023-11-06 13:10 UTC (permalink / raw)
  To: linux-embedded

Good day!

I plan to migrate the linux-embedded@vger.kernel.org list to the new
infrastructure this week. We're still doing it list-by-list to make sure that
we don't run into scaling issues with the new infra.

The migration will be performed live and should not require any downtime.
There will be no changes to how anyone interacts with the list after
migration is completed, so no action is required on anyone's part.

Please let me know if you have any concerns.

Best wishes,
-K

^ permalink raw reply

* PSA: This list is being migrated (no action required)
From: Konstantin Ryabitsev @ 2023-11-10 18:51 UTC (permalink / raw)
  To: linux-embedded, linux-ext4, linux-fbdev, linux-fpga,
	linux-fscrypt, linux-gcc, linux-gpio, linux-hams, linux-hexagon,
	linux-hotplug, linux-hwmon, linux-i2c, linux-ia64, linux-ide,
	linux-iio, linux-input, linux-integrity, linux-kbuild,
	linux-kselftest, linux-leds, linux-m68k, linux-man, linux-media,
	linux-mips, linux-mmc, linux-msdos

Hello, all:

This list is being migrated to new vger infrastructure. No action is required
on your part and there will be no change in how you interact with this list
after the migration is completed.

There will be a short 30-minute delay to the list archives on lore.kernel.org.
Once the backend work is done, I will follow up with another message.

-K


^ permalink raw reply

* Re: PSA: This list is being migrated (no action required)
From: Konstantin Ryabitsev @ 2023-11-10 19:35 UTC (permalink / raw)
  To: linux-embedded, linux-ext4, linux-fbdev, linux-fpga,
	linux-fscrypt, linux-gcc, linux-gpio, linux-hams, linux-hexagon,
	linux-hotplug, linux-hwmon, linux-i2c, linux-ia64, linux-ide,
	linux-iio, linux-input, linux-integrity, linux-kbuild,
	linux-kselftest, linux-leds, linux-m68k, linux-man, linux-media,
	linux-mips, linux-mmc, linux-msdos
In-Reply-To: <cfriwrxovqzcrptf74ccq52lcqj2nsergucufsz6wlh45fdnz3@z5e5y2lowbq2>

On Fri, Nov 10, 2023 at 01:51:44PM -0500, Konstantin Ryabitsev wrote:
> This list is being migrated to new vger infrastructure. No action is required
> on your part and there will be no change in how you interact with this list
> after the migration is completed.
> 
> There will be a short 30-minute delay to the list archives on lore.kernel.org.
> Once the backend work is done, I will follow up with another message.

This work is completed now. This message acts as a test to make sure archives
are working at their new place.

If anything is not working or looking right, please reach out to
helpdesk@kernel.org.

-K

^ permalink raw reply

* Bezplatná 60denní zkušební verze: Vylepšete své výrobní procesy
From: Michal Rmoutil @ 2023-12-04  8:50 UTC (permalink / raw)
  To: linux-embedded

Dobré ráno

Znáte systém, který nejen hlídá, ale i optimalizuje výrobu a přináší stálý příjem?

Díky nejnovějším technologiím a analýze dat naše řešení identifikuje oblasti optimalizace, zvýšení efektivity a snížení nákladů. Naši klienti zaznamenali nárůst příjmů v průměru o 20 % a dnes si to můžete vyzkoušet na 60 dní zdarma.

Pokud chcete další podrobnosti, odpovězte prosím na kontaktní číslo.


Pozdravy
Michal Rmoutil

^ permalink raw reply

* Debugging early SError exception
From: Lior Weintraub @ 2023-12-17 21:32 UTC (permalink / raw)
  To: linux-embedded@vger.kernel.org

Hi,

We have a new SoC with eLinux porting (kernel v6.5).
This SoC is ARM64 (A53) single core based device.
It runs correctly on QEMU but fails with SError on emulation platform (Synopsys Zebu running our SoC model).
There is no debugger connected to this emulation but there are several debug capabilities we can use:
1. Generating wave dump of CPU signals
2. Generate a Tarmac log
3. UART

Since the SError happens at early stages of Linux boot the UART is not enabled yet.
From the Tarmac log we can see:
 3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:	ret 	(parse_early_param)
 3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:	mov	x0,	#0xc0	//	#192 	(setup_arch)
                    R X0 (AARCH64) 00000000 000000c0
 3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:	msr	daif,	x0 	(setup_arch)
                    R CPSR 600000c5
 3824884529 ps  ES  System Error (Abort)
                    EXC [0x380] SError/vSError Current EL with SP_ELx
                    R ESR_EL1 (AARCH64) bf000002
                    R CPSR 600003c5
                    R SPSR_EL1 (AARCH64) 600000c5
                    R ELR_EL1 (AARCH64) ffff8000 80763a68
 3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:	sub	sp,	sp,	#0x150 	(vectors)
                    R SP_EL1 (AARCH64) ffff8000 808f3c50
 3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:	add	sp,	sp,	x0 	(vectors)
                    R SP_EL1 (AARCH64) ffff8000 808f3d10
 3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:	sub	x0,	sp,	x0 	(vectors)
                    R X0 (AARCH64) ffff8000 808f3c50
 3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:	tbnz	w0,	#14,	ffff800080010b9c	<vectors+0x39c> 	(vectors)
 3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:	sub	x0,	sp,	x0 	(vectors)
                    R X0 (AARCH64) 00000000 000000c0
 3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:	sub	sp,	sp,	x0 	(vectors)
                    R SP_EL1 (AARCH64) ffff8000 808f3c50
 3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:	b	ffff800080011354	<el1h_64_error> 	(vectors)

If I understand correctly, the exception happened sometime earlier and only now Linux boot code (setup_arch) opened the exception handling and as a result we immediately jump to the SError exception handler.
From the Linux source:
	parse_early_param();

	dynamic_scs_init();

	/*
	 * Unmask asynchronous aborts and fiq after bringing up possible
	 * earlycon. (Report possible System Errors once we can report this
	 * occurred).
	 */
	local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get the exception.

After some kernel hacking (replacing printk) we could extract the logs:
6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
6Machine model: Pliops Spider MK-I EVK
2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
Hardware name: Pliops Spider MK-I EVK (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : setup_arch+0x13c/0x5ac
lr : setup_arch+0x134/0x5ac
sp : ffff8000808f3da0
x29: ffff8000808f3da0c x28: 0000000008758074c x27: 0000000005e31b58c
x26: 0000000000000001c x25: 0000000007e5f728c x24: ffff8000808f8000c
x23: ffff8000808f8600c x22: ffff8000807b6000c x21: ffff800080010000c
x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
x17: 00000000fcad60bbc x16: 0000000000001800c x15: 0000000000000008c
x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
x11: 0101010101010101c x10: ffffffffffee87dfc x9 : 0000000000000038c
x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : 0000000000000001c
x5 : 0000000000000000c x4 : 8000000000000000c x3 : 0000000000000065c
x2 : 0000000000000000c x1 : 0000000000000000c x0 : 00000000000000c0c
0Kernel panic - not syncing: Asynchronous SError Interrupt
CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
Hardware name: Pliops Spider MK-I EVK (DT)
Call trace:
 dump_backtrace+0x9c/0xd0
 show_stack+0x14/0x1c
 dump_stack_lvl+0x44/0x58
 dump_stack+0x14/0x1c
 panic+0x2e0/0x33c
 nmi_panic+0x68/0x6c
 arm64_serror_panic+0x68/0x78
 do_serror+0x24/0x54
 el1h_64_error_handler+0x2c/0x40
 el1h_64_error+0x64/0x68
 setup_arch+0x13c/0x5ac
 start_kernel+0x5c/0x5b8
 __primary_switched+0xb4/0xbc
0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

Can you please advice how to proceed with debugging?

Thanks in advanced,
Cheers,
Lior.



^ permalink raw reply

* Re: Debugging early SError exception
From: Dirk Behme @ 2023-12-19  7:09 UTC (permalink / raw)
  To: Lior Weintraub, linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB05556D6B225E93A2B5BFFE88C391A@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> Hi,
> 
> We have a new SoC with eLinux porting (kernel v6.5).
> This SoC is ARM64 (A53) single core based device.
> It runs correctly on QEMU but fails with SError on emulation platform (Synopsys Zebu running our SoC model).
> There is no debugger connected to this emulation but there are several debug capabilities we can use:
> 1. Generating wave dump of CPU signals
> 2. Generate a Tarmac log
> 3. UART
> 
> Since the SError happens at early stages of Linux boot the UART is not enabled yet.
>  From the Tarmac log we can see:
>   3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:	ret 	(parse_early_param)
>   3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:	mov	x0,	#0xc0	//	#192 	(setup_arch)
>                      R X0 (AARCH64) 00000000 000000c0
>   3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:	msr	daif,	x0 	(setup_arch)
>                      R CPSR 600000c5
>   3824884529 ps  ES  System Error (Abort)
>                      EXC [0x380] SError/vSError Current EL with SP_ELx
>                      R ESR_EL1 (AARCH64) bf000002
>                      R CPSR 600003c5
>                      R SPSR_EL1 (AARCH64) 600000c5
>                      R ELR_EL1 (AARCH64) ffff8000 80763a68
>   3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:	sub	sp,	sp,	#0x150 	(vectors)
>                      R SP_EL1 (AARCH64) ffff8000 808f3c50
>   3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:	add	sp,	sp,	x0 	(vectors)
>                      R SP_EL1 (AARCH64) ffff8000 808f3d10
>   3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:	sub	x0,	sp,	x0 	(vectors)
>                      R X0 (AARCH64) ffff8000 808f3c50
>   3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:	tbnz	w0,	#14,	ffff800080010b9c	<vectors+0x39c> 	(vectors)
>   3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:	sub	x0,	sp,	x0 	(vectors)
>                      R X0 (AARCH64) 00000000 000000c0
>   3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:	sub	sp,	sp,	x0 	(vectors)
>                      R SP_EL1 (AARCH64) ffff8000 808f3c50
>   3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:	b	ffff800080011354	<el1h_64_error> 	(vectors)
> 
> If I understand correctly, the exception happened sometime earlier and only now Linux boot code (setup_arch) opened the exception handling and as a result we immediately jump to the SError exception handler.


Yes, that sounds reasonable. If I understood correctly, you are 
running something "quite new" on some software (QEMU) and hardware 
(Synopsis) simulators.

That would mean that you have new hardware with e.g. new memory map 
not used before. What you describe might sound like in the code before 
Linux (boot loader) there is anything resulting in the SError. This 
might be an access to non-existing or non-enabled hardware. I.e. it 
might be that you try to access (read/write) an address what is not 
available, yet (or just invalid). It's hard to debug that. In case you 
are able to modify the code before Linux (the boot loader?) you might 
try to enable SError exceptions, there, too. To get it earlier and 
with that make the search window smaller. I'm not that familiar with 
QEMU, but could you try to trace which (all?) hardware accesses your 
code does. And with that analyse all accesses and with that check if 
all these accesses are valid even on the hardware (Synopsis) emulation 
system? That should be checked from valid address and from hardware 
subsystem enablement point of view.

Hth,

Dirk


>  From the Linux source:
> 	parse_early_param();
> 
> 	dynamic_scs_init();
> 
> 	/*
> 	 * Unmask asynchronous aborts and fiq after bringing up possible
> 	 * earlycon. (Report possible System Errors once we can report this
> 	 * occurred).
> 	 */
> 	local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get the exception.
> 
> After some kernel hacking (replacing printk) we could extract the logs:
> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> 6Machine model: Pliops Spider MK-I EVK
> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> Hardware name: Pliops Spider MK-I EVK (DT)
> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : setup_arch+0x13c/0x5ac
> lr : setup_arch+0x134/0x5ac
> sp : ffff8000808f3da0
> x29: ffff8000808f3da0c x28: 0000000008758074c x27: 0000000005e31b58c
> x26: 0000000000000001c x25: 0000000007e5f728c x24: ffff8000808f8000c
> x23: ffff8000808f8600c x22: ffff8000807b6000c x21: ffff800080010000c
> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
> x17: 00000000fcad60bbc x16: 0000000000001800c x15: 0000000000000008c
> x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
> x11: 0101010101010101c x10: ffffffffffee87dfc x9 : 0000000000000038c
> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : 0000000000000001c
> x5 : 0000000000000000c x4 : 8000000000000000c x3 : 0000000000000065c
> x2 : 0000000000000000c x1 : 0000000000000000c x0 : 00000000000000c0c
> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> Hardware name: Pliops Spider MK-I EVK (DT)
> Call trace:
>   dump_backtrace+0x9c/0xd0
>   show_stack+0x14/0x1c
>   dump_stack_lvl+0x44/0x58
>   dump_stack+0x14/0x1c
>   panic+0x2e0/0x33c
>   nmi_panic+0x68/0x6c
>   arm64_serror_panic+0x68/0x78
>   do_serror+0x24/0x54
>   el1h_64_error_handler+0x2c/0x40
>   el1h_64_error+0x64/0x68
>   setup_arch+0x13c/0x5ac
>   start_kernel+0x5c/0x5b8
>   __primary_switched+0xb4/0xbc
> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> 
> Can you please advice how to proceed with debugging?
> 
> Thanks in advanced,
> Cheers,
> Lior.
> 
> 


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-19 13:23 UTC (permalink / raw)
  To: Dirk Behme, linux-embedded@vger.kernel.org
In-Reply-To: <b17b9901-5007-4d12-99d9-be531360227e@gmail.com>

Thanks Dirk,
Cheers,
Lior.

> -----Original Message-----
> From: Dirk Behme <dirk.behme@gmail.com>
> Sent: Tuesday, December 19, 2023 9:09 AM
> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> > Hi,
> >
> > We have a new SoC with eLinux porting (kernel v6.5).
> > This SoC is ARM64 (A53) single core based device.
> > It runs correctly on QEMU but fails with SError on emulation platform
> (Synopsys Zebu running our SoC model).
> > There is no debugger connected to this emulation but there are several
> debug capabilities we can use:
> > 1. Generating wave dump of CPU signals
> > 2. Generate a Tarmac log
> > 3. UART
> >
> > Since the SError happens at early stages of Linux boot the UART is not
> enabled yet.
> >  From the Tarmac log we can see:
> >   3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
> (parse_early_param)
> >   3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov     x0,
> #0xc0   //      #192    (setup_arch)
> >                      R X0 (AARCH64) 00000000 000000c0
> >   3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
> daif,   x0      (setup_arch)
> >                      R CPSR 600000c5
> >   3824884529 ps  ES  System Error (Abort)
> >                      EXC [0x380] SError/vSError Current EL with SP_ELx
> >                      R ESR_EL1 (AARCH64) bf000002
> >                      R CPSR 600003c5
> >                      R SPSR_EL1 (AARCH64) 600000c5
> >                      R ELR_EL1 (AARCH64) ffff8000 80763a68
> >   3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub     sp,
> sp,     #0x150  (vectors)
> >                      R SP_EL1 (AARCH64) ffff8000 808f3c50
> >   3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add     sp,
> sp,     x0      (vectors)
> >                      R SP_EL1 (AARCH64) ffff8000 808f3d10
> >   3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub     x0,
> sp,     x0      (vectors)
> >                      R X0 (AARCH64) ffff8000 808f3c50
> >   3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz    w0,
> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >   3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub     x0,
> sp,     x0      (vectors)
> >                      R X0 (AARCH64) 00000000 000000c0
> >   3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub     sp,
> sp,     x0      (vectors)
> >                      R SP_EL1 (AARCH64) ffff8000 808f3c50
> >   3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> ffff800080011354        <el1h_64_error>         (vectors)
> >
> > If I understand correctly, the exception happened sometime earlier and only
> now Linux boot code (setup_arch) opened the exception handling and as a
> result we immediately jump to the SError exception handler.
> 
> 
> Yes, that sounds reasonable. If I understood correctly, you are
> running something "quite new" on some software (QEMU) and hardware
> (Synopsis) simulators.
> 
> That would mean that you have new hardware with e.g. new memory map
> not used before. What you describe might sound like in the code before
> Linux (boot loader) there is anything resulting in the SError. This
> might be an access to non-existing or non-enabled hardware. I.e. it
> might be that you try to access (read/write) an address what is not
> available, yet (or just invalid). It's hard to debug that. In case you
> are able to modify the code before Linux (the boot loader?) you might
> try to enable SError exceptions, there, too. To get it earlier and
> with that make the search window smaller. I'm not that familiar with
> QEMU, but could you try to trace which (all?) hardware accesses your
> code does. And with that analyse all accesses and with that check if
> all these accesses are valid even on the hardware (Synopsis) emulation
> system? That should be checked from valid address and from hardware
> subsystem enablement point of view.
> 
> Hth,
> 
> Dirk
> 
> 
> >  From the Linux source:
> >       parse_early_param();
> >
> >       dynamic_scs_init();
> >
> >       /*
> >        * Unmask asynchronous aborts and fiq after bringing up possible
> >        * earlycon. (Report possible System Errors once we can report this
> >        * occurred).
> >        */
> >       local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get the
> exception.
> >
> > After some kernel hacking (replacing printk) we could extract the logs:
> > 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> > 6Machine model: Pliops Spider MK-I EVK
> > 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> > CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > Hardware name: Pliops Spider MK-I EVK (DT)
> > pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : setup_arch+0x13c/0x5ac
> > lr : setup_arch+0x134/0x5ac
> > sp : ffff8000808f3da0
> > x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> 0000000005e31b58c
> > x26: 0000000000000001c x25: 0000000007e5f728c x24:
> ffff8000808f8000c
> > x23: ffff8000808f8600c x22: ffff8000807b6000c x21: ffff800080010000c
> > x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
> > x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> 0000000000000008c
> > x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
> > x11: 0101010101010101c x10: ffffffffffee87dfc x9 : 0000000000000038c
> > x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : 0000000000000001c
> > x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> 0000000000000065c
> > x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> 00000000000000c0c
> > 0Kernel panic - not syncing: Asynchronous SError Interrupt
> > CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > Hardware name: Pliops Spider MK-I EVK (DT)
> > Call trace:
> >   dump_backtrace+0x9c/0xd0
> >   show_stack+0x14/0x1c
> >   dump_stack_lvl+0x44/0x58
> >   dump_stack+0x14/0x1c
> >   panic+0x2e0/0x33c
> >   nmi_panic+0x68/0x6c
> >   arm64_serror_panic+0x68/0x78
> >   do_serror+0x24/0x54
> >   el1h_64_error_handler+0x2c/0x40
> >   el1h_64_error+0x64/0x68
> >   setup_arch+0x13c/0x5ac
> >   start_kernel+0x5c/0x5b8
> >   __primary_switched+0xb4/0xbc
> > 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >
> > Can you please advice how to proceed with debugging?
> >
> > Thanks in advanced,
> > Cheers,
> > Lior.
> >
> >
> 


^ permalink raw reply

* Re: Debugging early SError exception
From: Dirk Behme @ 2023-12-19 13:37 UTC (permalink / raw)
  To: Lior Weintraub, linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB05550310B202221101564038C397A@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> Thanks Dirk,

Welcome :)

In case you find the root cause it would be nice to get some generic 
description of it so that we can learn something :)

Best regards

Dirk


>> -----Original Message-----
>> From: Dirk Behme <dirk.behme@gmail.com>
>> Sent: Tuesday, December 19, 2023 9:09 AM
>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>> Subject: Re: Debugging early SError exception
>>
>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> CAUTION: External Sender
>>
>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>> Hi,
>>>
>>> We have a new SoC with eLinux porting (kernel v6.5).
>>> This SoC is ARM64 (A53) single core based device.
>>> It runs correctly on QEMU but fails with SError on emulation platform
>> (Synopsys Zebu running our SoC model).
>>> There is no debugger connected to this emulation but there are several
>> debug capabilities we can use:
>>> 1. Generating wave dump of CPU signals
>>> 2. Generate a Tarmac log
>>> 3. UART
>>>
>>> Since the SError happens at early stages of Linux boot the UART is not
>> enabled yet.
>>>   From the Tarmac log we can see:
>>>    3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>> (parse_early_param)
>>>    3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov     x0,
>> #0xc0   //      #192    (setup_arch)
>>>                       R X0 (AARCH64) 00000000 000000c0
>>>    3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>> daif,   x0      (setup_arch)
>>>                       R CPSR 600000c5
>>>    3824884529 ps  ES  System Error (Abort)
>>>                       EXC [0x380] SError/vSError Current EL with SP_ELx
>>>                       R ESR_EL1 (AARCH64) bf000002
>>>                       R CPSR 600003c5
>>>                       R SPSR_EL1 (AARCH64) 600000c5
>>>                       R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>    3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub     sp,
>> sp,     #0x150  (vectors)
>>>                       R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>    3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add     sp,
>> sp,     x0      (vectors)
>>>                       R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>    3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub     x0,
>> sp,     x0      (vectors)
>>>                       R X0 (AARCH64) ffff8000 808f3c50
>>>    3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz    w0,
>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>    3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub     x0,
>> sp,     x0      (vectors)
>>>                       R X0 (AARCH64) 00000000 000000c0
>>>    3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub     sp,
>> sp,     x0      (vectors)
>>>                       R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>    3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>> ffff800080011354        <el1h_64_error>         (vectors)
>>>
>>> If I understand correctly, the exception happened sometime earlier and only
>> now Linux boot code (setup_arch) opened the exception handling and as a
>> result we immediately jump to the SError exception handler.
>>
>>
>> Yes, that sounds reasonable. If I understood correctly, you are
>> running something "quite new" on some software (QEMU) and hardware
>> (Synopsis) simulators.
>>
>> That would mean that you have new hardware with e.g. new memory map
>> not used before. What you describe might sound like in the code before
>> Linux (boot loader) there is anything resulting in the SError. This
>> might be an access to non-existing or non-enabled hardware. I.e. it
>> might be that you try to access (read/write) an address what is not
>> available, yet (or just invalid). It's hard to debug that. In case you
>> are able to modify the code before Linux (the boot loader?) you might
>> try to enable SError exceptions, there, too. To get it earlier and
>> with that make the search window smaller. I'm not that familiar with
>> QEMU, but could you try to trace which (all?) hardware accesses your
>> code does. And with that analyse all accesses and with that check if
>> all these accesses are valid even on the hardware (Synopsis) emulation
>> system? That should be checked from valid address and from hardware
>> subsystem enablement point of view.
>>
>> Hth,
>>
>> Dirk
>>
>>
>>>   From the Linux source:
>>>        parse_early_param();
>>>
>>>        dynamic_scs_init();
>>>
>>>        /*
>>>         * Unmask asynchronous aborts and fiq after bringing up possible
>>>         * earlycon. (Report possible System Errors once we can report this
>>>         * occurred).
>>>         */
>>>        local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get the
>> exception.
>>>
>>> After some kernel hacking (replacing printk) we could extract the logs:
>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>> 6Machine model: Pliops Spider MK-I EVK
>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> pc : setup_arch+0x13c/0x5ac
>>> lr : setup_arch+0x134/0x5ac
>>> sp : ffff8000808f3da0
>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>> 0000000005e31b58c
>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>> ffff8000808f8000c
>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21: ffff800080010000c
>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>> 0000000000000008c
>>> x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 : 0000000000000038c
>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : 0000000000000001c
>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>> 0000000000000065c
>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>> 00000000000000c0c
>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>> Call trace:
>>>    dump_backtrace+0x9c/0xd0
>>>    show_stack+0x14/0x1c
>>>    dump_stack_lvl+0x44/0x58
>>>    dump_stack+0x14/0x1c
>>>    panic+0x2e0/0x33c
>>>    nmi_panic+0x68/0x6c
>>>    arm64_serror_panic+0x68/0x78
>>>    do_serror+0x24/0x54
>>>    el1h_64_error_handler+0x2c/0x40
>>>    el1h_64_error+0x64/0x68
>>>    setup_arch+0x13c/0x5ac
>>>    start_kernel+0x5c/0x5b8
>>>    __primary_switched+0xb4/0xbc
>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>
>>> Can you please advice how to proceed with debugging?
>>>
>>> Thanks in advanced,
>>> Cheers,
>>> Lior.
>>>
>>>
>>
> 


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-21  7:43 UTC (permalink / raw)
  To: Dirk Behme, linux-embedded@vger.kernel.org
In-Reply-To: <375eeb75-dde5-4806-a2d7-7f4e97342ee8@gmail.com>

Hi Dirk,

We found that the issue was at the early stages of Barebox (a.k.a U-BOOT v2).
Our implementation of putc_ll (on debug_ll) was writing into the UART Tx FIFO without checking if the FIFO is full.
Once the fifo got full it caused this SError probably because the UART IP generated an apberror signal.

Now the Linux is running and doesn't report the SError again but now we face another issue.
We see that the PC is getting into a "report_bug" function.
The Linux doesn't print anything to the UART (probably since it hasn't got to the point where the console is configured?).
Since our debug means are limited it can take some time to find the root cause.

I will keep you posted and update our findings.
Love to hear your thoughts,

Cheers,
Lior.
 

> -----Original Message-----
> From: Dirk Behme <dirk.behme@gmail.com>
> Sent: Tuesday, December 19, 2023 3:37 PM
> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> > Thanks Dirk,
> 
> Welcome :)
> 
> In case you find the root cause it would be nice to get some generic
> description of it so that we can learn something :)
> 
> Best regards
> 
> Dirk
> 
> 
> >> -----Original Message-----
> >> From: Dirk Behme <dirk.behme@gmail.com>
> >> Sent: Tuesday, December 19, 2023 9:09 AM
> >> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >> Subject: Re: Debugging early SError exception
> >>
> >> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>
> >> CAUTION: External Sender
> >>
> >> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>> Hi,
> >>>
> >>> We have a new SoC with eLinux porting (kernel v6.5).
> >>> This SoC is ARM64 (A53) single core based device.
> >>> It runs correctly on QEMU but fails with SError on emulation platform
> >> (Synopsys Zebu running our SoC model).
> >>> There is no debugger connected to this emulation but there are several
> >> debug capabilities we can use:
> >>> 1. Generating wave dump of CPU signals
> >>> 2. Generate a Tarmac log
> >>> 3. UART
> >>>
> >>> Since the SError happens at early stages of Linux boot the UART is not
> >> enabled yet.
> >>>   From the Tarmac log we can see:
> >>>    3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
> >> (parse_early_param)
> >>>    3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
> x0,
> >> #0xc0   //      #192    (setup_arch)
> >>>                       R X0 (AARCH64) 00000000 000000c0
> >>>    3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
> >> daif,   x0      (setup_arch)
> >>>                       R CPSR 600000c5
> >>>    3824884529 ps  ES  System Error (Abort)
> >>>                       EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>                       R ESR_EL1 (AARCH64) bf000002
> >>>                       R CPSR 600003c5
> >>>                       R SPSR_EL1 (AARCH64) 600000c5
> >>>                       R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>    3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
> sp,
> >> sp,     #0x150  (vectors)
> >>>                       R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>    3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
> sp,
> >> sp,     x0      (vectors)
> >>>                       R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>    3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
> x0,
> >> sp,     x0      (vectors)
> >>>                       R X0 (AARCH64) ffff8000 808f3c50
> >>>    3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
> w0,
> >> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>    3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
> x0,
> >> sp,     x0      (vectors)
> >>>                       R X0 (AARCH64) 00000000 000000c0
> >>>    3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub     sp,
> >> sp,     x0      (vectors)
> >>>                       R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>    3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> >> ffff800080011354        <el1h_64_error>         (vectors)
> >>>
> >>> If I understand correctly, the exception happened sometime earlier and
> only
> >> now Linux boot code (setup_arch) opened the exception handling and as a
> >> result we immediately jump to the SError exception handler.
> >>
> >>
> >> Yes, that sounds reasonable. If I understood correctly, you are
> >> running something "quite new" on some software (QEMU) and hardware
> >> (Synopsis) simulators.
> >>
> >> That would mean that you have new hardware with e.g. new memory map
> >> not used before. What you describe might sound like in the code before
> >> Linux (boot loader) there is anything resulting in the SError. This
> >> might be an access to non-existing or non-enabled hardware. I.e. it
> >> might be that you try to access (read/write) an address what is not
> >> available, yet (or just invalid). It's hard to debug that. In case you
> >> are able to modify the code before Linux (the boot loader?) you might
> >> try to enable SError exceptions, there, too. To get it earlier and
> >> with that make the search window smaller. I'm not that familiar with
> >> QEMU, but could you try to trace which (all?) hardware accesses your
> >> code does. And with that analyse all accesses and with that check if
> >> all these accesses are valid even on the hardware (Synopsis) emulation
> >> system? That should be checked from valid address and from hardware
> >> subsystem enablement point of view.
> >>
> >> Hth,
> >>
> >> Dirk
> >>
> >>
> >>>   From the Linux source:
> >>>        parse_early_param();
> >>>
> >>>        dynamic_scs_init();
> >>>
> >>>        /*
> >>>         * Unmask asynchronous aborts and fiq after bringing up possible
> >>>         * earlycon. (Report possible System Errors once we can report this
> >>>         * occurred).
> >>>         */
> >>>        local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get
> the
> >> exception.
> >>>
> >>> After some kernel hacking (replacing printk) we could extract the logs:
> >>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> >> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> >> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>> 6Machine model: Pliops Spider MK-I EVK
> >>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>> pc : setup_arch+0x13c/0x5ac
> >>> lr : setup_arch+0x134/0x5ac
> >>> sp : ffff8000808f3da0
> >>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >> 0000000005e31b58c
> >>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >> ffff8000808f8000c
> >>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> ffff800080010000c
> >>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
> >>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >> 0000000000000008c
> >>> x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
> >>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> 0000000000000038c
> >>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> 0000000000000001c
> >>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >> 0000000000000065c
> >>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >> 00000000000000c0c
> >>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>> Call trace:
> >>>    dump_backtrace+0x9c/0xd0
> >>>    show_stack+0x14/0x1c
> >>>    dump_stack_lvl+0x44/0x58
> >>>    dump_stack+0x14/0x1c
> >>>    panic+0x2e0/0x33c
> >>>    nmi_panic+0x68/0x6c
> >>>    arm64_serror_panic+0x68/0x78
> >>>    do_serror+0x24/0x54
> >>>    el1h_64_error_handler+0x2c/0x40
> >>>    el1h_64_error+0x64/0x68
> >>>    setup_arch+0x13c/0x5ac
> >>>    start_kernel+0x5c/0x5b8
> >>>    __primary_switched+0xb4/0xbc
> >>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >>>
> >>> Can you please advice how to proceed with debugging?
> >>>
> >>> Thanks in advanced,
> >>> Cheers,
> >>> Lior.
> >>>
> >>>
> >>
> >


^ permalink raw reply

* Re: Debugging early SError exception
From: Dirk Behme @ 2023-12-21  8:29 UTC (permalink / raw)
  To: Lior Weintraub, linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB05553792B68C8A71BC08E4A5C395A@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> Hi Dirk,
> 
> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT v2).

Glad to hear that! :)

> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx FIFO without checking if the FIFO is full.
> Once the fifo got full it caused this SError probably because the UART IP generated an apberror signal.

Thanks for the report!

> Now the Linux is running and doesn't report the SError again but now we face another issue.
> We see that the PC is getting into a "report_bug" function.
> The Linux doesn't print anything to the UART (probably since it hasn't got to the point where the console is configured?).

For cases like this using earlyprintk is usually a good option. Check 
the Linux kernel serial console (UART) dirver of you SoC if it 
supports it. In the end it should be "just" a function in the serial 
console driver which outputs the console data via polling before 
(later) the interrupt driven console part takes over.

Best regards

Dirk


> Since our debug means are limited it can take some time to find the root cause.
> 
> I will keep you posted and update our findings.
> Love to hear your thoughts,
> 
> Cheers,
> Lior.
>   
> 
>> -----Original Message-----
>> From: Dirk Behme <dirk.behme@gmail.com>
>> Sent: Tuesday, December 19, 2023 3:37 PM
>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>> Subject: Re: Debugging early SError exception
>>
>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> CAUTION: External Sender
>>
>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
>>> Thanks Dirk,
>>
>> Welcome :)
>>
>> In case you find the root cause it would be nice to get some generic
>> description of it so that we can learn something :)
>>
>> Best regards
>>
>> Dirk
>>
>>
>>>> -----Original Message-----
>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>> Sent: Tuesday, December 19, 2023 9:09 AM
>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>> Subject: Re: Debugging early SError exception
>>>>
>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> CAUTION: External Sender
>>>>
>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>>>> Hi,
>>>>>
>>>>> We have a new SoC with eLinux porting (kernel v6.5).
>>>>> This SoC is ARM64 (A53) single core based device.
>>>>> It runs correctly on QEMU but fails with SError on emulation platform
>>>> (Synopsys Zebu running our SoC model).
>>>>> There is no debugger connected to this emulation but there are several
>>>> debug capabilities we can use:
>>>>> 1. Generating wave dump of CPU signals
>>>>> 2. Generate a Tarmac log
>>>>> 3. UART
>>>>>
>>>>> Since the SError happens at early stages of Linux boot the UART is not
>>>> enabled yet.
>>>>>    From the Tarmac log we can see:
>>>>>     3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>>>> (parse_early_param)
>>>>>     3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
>> x0,
>>>> #0xc0   //      #192    (setup_arch)
>>>>>                        R X0 (AARCH64) 00000000 000000c0
>>>>>     3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>>>> daif,   x0      (setup_arch)
>>>>>                        R CPSR 600000c5
>>>>>     3824884529 ps  ES  System Error (Abort)
>>>>>                        EXC [0x380] SError/vSError Current EL with SP_ELx
>>>>>                        R ESR_EL1 (AARCH64) bf000002
>>>>>                        R CPSR 600003c5
>>>>>                        R SPSR_EL1 (AARCH64) 600000c5
>>>>>                        R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>>>     3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
>> sp,
>>>> sp,     #0x150  (vectors)
>>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>     3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
>> sp,
>>>> sp,     x0      (vectors)
>>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>>>     3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
>> x0,
>>>> sp,     x0      (vectors)
>>>>>                        R X0 (AARCH64) ffff8000 808f3c50
>>>>>     3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
>> w0,
>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>>>     3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
>> x0,
>>>> sp,     x0      (vectors)
>>>>>                        R X0 (AARCH64) 00000000 000000c0
>>>>>     3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub     sp,
>>>> sp,     x0      (vectors)
>>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>     3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>>>> ffff800080011354        <el1h_64_error>         (vectors)
>>>>>
>>>>> If I understand correctly, the exception happened sometime earlier and
>> only
>>>> now Linux boot code (setup_arch) opened the exception handling and as a
>>>> result we immediately jump to the SError exception handler.
>>>>
>>>>
>>>> Yes, that sounds reasonable. If I understood correctly, you are
>>>> running something "quite new" on some software (QEMU) and hardware
>>>> (Synopsis) simulators.
>>>>
>>>> That would mean that you have new hardware with e.g. new memory map
>>>> not used before. What you describe might sound like in the code before
>>>> Linux (boot loader) there is anything resulting in the SError. This
>>>> might be an access to non-existing or non-enabled hardware. I.e. it
>>>> might be that you try to access (read/write) an address what is not
>>>> available, yet (or just invalid). It's hard to debug that. In case you
>>>> are able to modify the code before Linux (the boot loader?) you might
>>>> try to enable SError exceptions, there, too. To get it earlier and
>>>> with that make the search window smaller. I'm not that familiar with
>>>> QEMU, but could you try to trace which (all?) hardware accesses your
>>>> code does. And with that analyse all accesses and with that check if
>>>> all these accesses are valid even on the hardware (Synopsis) emulation
>>>> system? That should be checked from valid address and from hardware
>>>> subsystem enablement point of view.
>>>>
>>>> Hth,
>>>>
>>>> Dirk
>>>>
>>>>
>>>>>    From the Linux source:
>>>>>         parse_early_param();
>>>>>
>>>>>         dynamic_scs_init();
>>>>>
>>>>>         /*
>>>>>          * Unmask asynchronous aborts and fiq after bringing up possible
>>>>>          * earlycon. (Report possible System Errors once we can report this
>>>>>          * occurred).
>>>>>          */
>>>>>         local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we get
>> the
>>>> exception.
>>>>>
>>>>> After some kernel hacking (replacing printk) we could extract the logs:
>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>>>> 6Machine model: Pliops Spider MK-I EVK
>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> pc : setup_arch+0x13c/0x5ac
>>>>> lr : setup_arch+0x134/0x5ac
>>>>> sp : ffff8000808f3da0
>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>>>> 0000000005e31b58c
>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>>>> ffff8000808f8000c
>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
>> ffff800080010000c
>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: 000000002266684ac
>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>>>> 0000000000000008c
>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12: 0000000000000003c
>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
>> 0000000000000038c
>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
>> 0000000000000001c
>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>>>> 0000000000000065c
>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>>>> 00000000000000c0c
>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>> Call trace:
>>>>>     dump_backtrace+0x9c/0xd0
>>>>>     show_stack+0x14/0x1c
>>>>>     dump_stack_lvl+0x44/0x58
>>>>>     dump_stack+0x14/0x1c
>>>>>     panic+0x2e0/0x33c
>>>>>     nmi_panic+0x68/0x6c
>>>>>     arm64_serror_panic+0x68/0x78
>>>>>     do_serror+0x24/0x54
>>>>>     el1h_64_error_handler+0x2c/0x40
>>>>>     el1h_64_error+0x64/0x68
>>>>>     setup_arch+0x13c/0x5ac
>>>>>     start_kernel+0x5c/0x5b8
>>>>>     __primary_switched+0xb4/0xbc
>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>>>
>>>>> Can you please advice how to proceed with debugging?
>>>>>
>>>>> Thanks in advanced,
>>>>> Cheers,
>>>>> Lior.
>>>>>
>>>>>
>>>>
>>>
> 


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-21 10:04 UTC (permalink / raw)
  To: Dirk Behme, linux-embedded@vger.kernel.org
In-Reply-To: <b139c136-0417-4ac5-b99a-ba999d7418a0@gmail.com>

Thanks Dirk,

Regarding the earlyprintk, not sure I know how to make it work.
I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my config but it doesn't seem to work.
Do I need to pass something in the bootargs from the U-BOOT?
Do I need to add that into my device tree?
(Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen" on my DT but it didn't work)

The UART I am using is "snps,dw-apb-uart".

Last week, to output the early logs I have implemented this hack:
1. Modify printk macro to run my print_func
2. This print_func wrote the characters into a single global variable (u32 simul_uart;)
3. Get the address location of this global variable and extract all writes to it from the Tarmac logs.

This is a very slow and tedious process but it helped me identify the initial SError.
Initially I thought I can write directly into the UART FIFO register (which I know the address) but this didn't work because Linux already setup the MMU so I guess I need to know the virtual address of this FIFO.
Do I need to use __phys_to_virt of some sort?

Cheers,
Lior.

> -----Original Message-----
> From: Dirk Behme <dirk.behme@gmail.com>
> Sent: Thursday, December 21, 2023 10:30 AM
> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> > Hi Dirk,
> >
> > We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
> v2).
> 
> Glad to hear that! :)
> 
> > Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
> FIFO without checking if the FIFO is full.
> > Once the fifo got full it caused this SError probably because the UART IP
> generated an apberror signal.
> 
> Thanks for the report!
> 
> > Now the Linux is running and doesn't report the SError again but now we
> face another issue.
> > We see that the PC is getting into a "report_bug" function.
> > The Linux doesn't print anything to the UART (probably since it hasn't got to
> the point where the console is configured?).
> 
> For cases like this using earlyprintk is usually a good option. Check
> the Linux kernel serial console (UART) dirver of you SoC if it
> supports it. In the end it should be "just" a function in the serial
> console driver which outputs the console data via polling before
> (later) the interrupt driven console part takes over.
> 
> Best regards
> 
> Dirk
> 
> 
> > Since our debug means are limited it can take some time to find the root
> cause.
> >
> > I will keep you posted and update our findings.
> > Love to hear your thoughts,
> >
> > Cheers,
> > Lior.
> >
> >
> >> -----Original Message-----
> >> From: Dirk Behme <dirk.behme@gmail.com>
> >> Sent: Tuesday, December 19, 2023 3:37 PM
> >> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >> Subject: Re: Debugging early SError exception
> >>
> >> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>
> >> CAUTION: External Sender
> >>
> >> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> >>> Thanks Dirk,
> >>
> >> Welcome :)
> >>
> >> In case you find the root cause it would be nice to get some generic
> >> description of it so that we can learn something :)
> >>
> >> Best regards
> >>
> >> Dirk
> >>
> >>
> >>>> -----Original Message-----
> >>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>> Sent: Tuesday, December 19, 2023 9:09 AM
> >>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> embedded@vger.kernel.org
> >>>> Subject: Re: Debugging early SError exception
> >>>>
> >>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
> is
> >>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>
> >>>> CAUTION: External Sender
> >>>>
> >>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>>>> Hi,
> >>>>>
> >>>>> We have a new SoC with eLinux porting (kernel v6.5).
> >>>>> This SoC is ARM64 (A53) single core based device.
> >>>>> It runs correctly on QEMU but fails with SError on emulation platform
> >>>> (Synopsys Zebu running our SoC model).
> >>>>> There is no debugger connected to this emulation but there are several
> >>>> debug capabilities we can use:
> >>>>> 1. Generating wave dump of CPU signals
> >>>>> 2. Generate a Tarmac log
> >>>>> 3. UART
> >>>>>
> >>>>> Since the SError happens at early stages of Linux boot the UART is not
> >>>> enabled yet.
> >>>>>    From the Tarmac log we can see:
> >>>>>     3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
> >>>> (parse_early_param)
> >>>>>     3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
> >> x0,
> >>>> #0xc0   //      #192    (setup_arch)
> >>>>>                        R X0 (AARCH64) 00000000 000000c0
> >>>>>     3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
> >>>> daif,   x0      (setup_arch)
> >>>>>                        R CPSR 600000c5
> >>>>>     3824884529 ps  ES  System Error (Abort)
> >>>>>                        EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>>>                        R ESR_EL1 (AARCH64) bf000002
> >>>>>                        R CPSR 600003c5
> >>>>>                        R SPSR_EL1 (AARCH64) 600000c5
> >>>>>                        R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>>>     3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
> >> sp,
> >>>> sp,     #0x150  (vectors)
> >>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>     3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
> >> sp,
> >>>> sp,     x0      (vectors)
> >>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>>>     3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
> >> x0,
> >>>> sp,     x0      (vectors)
> >>>>>                        R X0 (AARCH64) ffff8000 808f3c50
> >>>>>     3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
> >> w0,
> >>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>>>     3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
> >> x0,
> >>>> sp,     x0      (vectors)
> >>>>>                        R X0 (AARCH64) 00000000 000000c0
> >>>>>     3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
> sp,
> >>>> sp,     x0      (vectors)
> >>>>>                        R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>     3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> >>>> ffff800080011354        <el1h_64_error>         (vectors)
> >>>>>
> >>>>> If I understand correctly, the exception happened sometime earlier and
> >> only
> >>>> now Linux boot code (setup_arch) opened the exception handling and as
> a
> >>>> result we immediately jump to the SError exception handler.
> >>>>
> >>>>
> >>>> Yes, that sounds reasonable. If I understood correctly, you are
> >>>> running something "quite new" on some software (QEMU) and
> hardware
> >>>> (Synopsis) simulators.
> >>>>
> >>>> That would mean that you have new hardware with e.g. new memory
> map
> >>>> not used before. What you describe might sound like in the code before
> >>>> Linux (boot loader) there is anything resulting in the SError. This
> >>>> might be an access to non-existing or non-enabled hardware. I.e. it
> >>>> might be that you try to access (read/write) an address what is not
> >>>> available, yet (or just invalid). It's hard to debug that. In case you
> >>>> are able to modify the code before Linux (the boot loader?) you might
> >>>> try to enable SError exceptions, there, too. To get it earlier and
> >>>> with that make the search window smaller. I'm not that familiar with
> >>>> QEMU, but could you try to trace which (all?) hardware accesses your
> >>>> code does. And with that analyse all accesses and with that check if
> >>>> all these accesses are valid even on the hardware (Synopsis) emulation
> >>>> system? That should be checked from valid address and from hardware
> >>>> subsystem enablement point of view.
> >>>>
> >>>> Hth,
> >>>>
> >>>> Dirk
> >>>>
> >>>>
> >>>>>    From the Linux source:
> >>>>>         parse_early_param();
> >>>>>
> >>>>>         dynamic_scs_init();
> >>>>>
> >>>>>         /*
> >>>>>          * Unmask asynchronous aborts and fiq after bringing up possible
> >>>>>          * earlycon. (Report possible System Errors once we can report this
> >>>>>          * occurred).
> >>>>>          */
> >>>>>         local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
> get
> >> the
> >>>> exception.
> >>>>>
> >>>>> After some kernel hacking (replacing printk) we could extract the logs:
> >>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> >>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
> (GNU
> >>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>>>> 6Machine model: Pliops Spider MK-I EVK
> >>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>> pc : setup_arch+0x13c/0x5ac
> >>>>> lr : setup_arch+0x134/0x5ac
> >>>>> sp : ffff8000808f3da0
> >>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >>>> 0000000005e31b58c
> >>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >>>> ffff8000808f8000c
> >>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> >> ffff800080010000c
> >>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> 000000002266684ac
> >>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >>>> 0000000000000008c
> >>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> 0000000000000003c
> >>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> >> 0000000000000038c
> >>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> >> 0000000000000001c
> >>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >>>> 0000000000000065c
> >>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >>>> 00000000000000c0c
> >>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>> Call trace:
> >>>>>     dump_backtrace+0x9c/0xd0
> >>>>>     show_stack+0x14/0x1c
> >>>>>     dump_stack_lvl+0x44/0x58
> >>>>>     dump_stack+0x14/0x1c
> >>>>>     panic+0x2e0/0x33c
> >>>>>     nmi_panic+0x68/0x6c
> >>>>>     arm64_serror_panic+0x68/0x78
> >>>>>     do_serror+0x24/0x54
> >>>>>     el1h_64_error_handler+0x2c/0x40
> >>>>>     el1h_64_error+0x64/0x68
> >>>>>     setup_arch+0x13c/0x5ac
> >>>>>     start_kernel+0x5c/0x5b8
> >>>>>     __primary_switched+0xb4/0xbc
> >>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >>>>>
> >>>>> Can you please advice how to proceed with debugging?
> >>>>>
> >>>>> Thanks in advanced,
> >>>>> Cheers,
> >>>>> Lior.
> >>>>>
> >>>>>
> >>>>
> >>>
> >


^ permalink raw reply

* Re: Debugging early SError exception
From: Dirk Behme @ 2023-12-21 11:19 UTC (permalink / raw)
  To: Lior Weintraub, linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB0555AA259A8E616B6C5BA823C395A@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> Thanks Dirk,
> 
> Regarding the earlyprintk, not sure I know how to make it work.
> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my config but it doesn't seem to work.
> Do I need to pass something in the bootargs from the U-BOOT?
> Do I need to add that into my device tree?
> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen" on my DT but it didn't work)

Yes, what has to be enabled and what not and what has to be set how is 
often confusing. I think this is not common for all systems, so I 
think to be on the safe side you have to look into the code for you 
system. Or short; The code is the documentation ;)


> The UART I am using is "snps,dw-apb-uart".
> 
> Last week, to output the early logs I have implemented this hack:
> 1. Modify printk macro to run my print_func
> 2. This print_func wrote the characters into a single global variable (u32 simul_uart;)
> 3. Get the address location of this global variable and extract all writes to it from the Tarmac logs.
> 
> This is a very slow and tedious process but it helped me identify the initial SError.
> Initially I thought I can write directly into the UART FIFO register (which I know the address) but this didn't work because Linux already setup the MMU so I guess I need to know the virtual address of this FIFO.
> Do I need to use __phys_to_virt of some sort?

Yes, I think so. Have a look to the existing serial driver, too. It 
should do whats needed, and you can borrow that, then.

Best regards

Dirk


> Cheers,
> Lior.
> 
>> -----Original Message-----
>> From: Dirk Behme <dirk.behme@gmail.com>
>> Sent: Thursday, December 21, 2023 10:30 AM
>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>> Subject: Re: Debugging early SError exception
>>
>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> CAUTION: External Sender
>>
>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
>>> Hi Dirk,
>>>
>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
>> v2).
>>
>> Glad to hear that! :)
>>
>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
>> FIFO without checking if the FIFO is full.
>>> Once the fifo got full it caused this SError probably because the UART IP
>> generated an apberror signal.
>>
>> Thanks for the report!
>>
>>> Now the Linux is running and doesn't report the SError again but now we
>> face another issue.
>>> We see that the PC is getting into a "report_bug" function.
>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
>> the point where the console is configured?).
>>
>> For cases like this using earlyprintk is usually a good option. Check
>> the Linux kernel serial console (UART) dirver of you SoC if it
>> supports it. In the end it should be "just" a function in the serial
>> console driver which outputs the console data via polling before
>> (later) the interrupt driven console part takes over.
>>
>> Best regards
>>
>> Dirk
>>
>>
>>> Since our debug means are limited it can take some time to find the root
>> cause.
>>>
>>> I will keep you posted and update our findings.
>>> Love to hear your thoughts,
>>>
>>> Cheers,
>>> Lior.
>>>
>>>
>>>> -----Original Message-----
>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>> Sent: Tuesday, December 19, 2023 3:37 PM
>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>> Subject: Re: Debugging early SError exception
>>>>
>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> CAUTION: External Sender
>>>>
>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
>>>>> Thanks Dirk,
>>>>
>>>> Welcome :)
>>>>
>>>> In case you find the root cause it would be nice to get some generic
>>>> description of it so that we can learn something :)
>>>>
>>>> Best regards
>>>>
>>>> Dirk
>>>>
>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
>> embedded@vger.kernel.org
>>>>>> Subject: Re: Debugging early SError exception
>>>>>>
>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
>> is
>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>
>>>>>> CAUTION: External Sender
>>>>>>
>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
>>>>>>> This SoC is ARM64 (A53) single core based device.
>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
>>>>>> (Synopsys Zebu running our SoC model).
>>>>>>> There is no debugger connected to this emulation but there are several
>>>>>> debug capabilities we can use:
>>>>>>> 1. Generating wave dump of CPU signals
>>>>>>> 2. Generate a Tarmac log
>>>>>>> 3. UART
>>>>>>>
>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
>>>>>> enabled yet.
>>>>>>>     From the Tarmac log we can see:
>>>>>>>      3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>>>>>> (parse_early_param)
>>>>>>>      3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
>>>> x0,
>>>>>> #0xc0   //      #192    (setup_arch)
>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
>>>>>>>      3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>>>>>> daif,   x0      (setup_arch)
>>>>>>>                         R CPSR 600000c5
>>>>>>>      3824884529 ps  ES  System Error (Abort)
>>>>>>>                         EXC [0x380] SError/vSError Current EL with SP_ELx
>>>>>>>                         R ESR_EL1 (AARCH64) bf000002
>>>>>>>                         R CPSR 600003c5
>>>>>>>                         R SPSR_EL1 (AARCH64) 600000c5
>>>>>>>                         R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>>>>>      3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
>>>> sp,
>>>>>> sp,     #0x150  (vectors)
>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>      3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
>>>> sp,
>>>>>> sp,     x0      (vectors)
>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>>>>>      3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
>>>> x0,
>>>>>> sp,     x0      (vectors)
>>>>>>>                         R X0 (AARCH64) ffff8000 808f3c50
>>>>>>>      3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
>>>> w0,
>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>>>>>      3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
>>>> x0,
>>>>>> sp,     x0      (vectors)
>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
>>>>>>>      3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
>> sp,
>>>>>> sp,     x0      (vectors)
>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>      3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
>>>>>>>
>>>>>>> If I understand correctly, the exception happened sometime earlier and
>>>> only
>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
>> a
>>>>>> result we immediately jump to the SError exception handler.
>>>>>>
>>>>>>
>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
>>>>>> running something "quite new" on some software (QEMU) and
>> hardware
>>>>>> (Synopsis) simulators.
>>>>>>
>>>>>> That would mean that you have new hardware with e.g. new memory
>> map
>>>>>> not used before. What you describe might sound like in the code before
>>>>>> Linux (boot loader) there is anything resulting in the SError. This
>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
>>>>>> might be that you try to access (read/write) an address what is not
>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
>>>>>> are able to modify the code before Linux (the boot loader?) you might
>>>>>> try to enable SError exceptions, there, too. To get it earlier and
>>>>>> with that make the search window smaller. I'm not that familiar with
>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
>>>>>> code does. And with that analyse all accesses and with that check if
>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
>>>>>> system? That should be checked from valid address and from hardware
>>>>>> subsystem enablement point of view.
>>>>>>
>>>>>> Hth,
>>>>>>
>>>>>> Dirk
>>>>>>
>>>>>>
>>>>>>>     From the Linux source:
>>>>>>>          parse_early_param();
>>>>>>>
>>>>>>>          dynamic_scs_init();
>>>>>>>
>>>>>>>          /*
>>>>>>>           * Unmask asynchronous aborts and fiq after bringing up possible
>>>>>>>           * earlycon. (Report possible System Errors once we can report this
>>>>>>>           * occurred).
>>>>>>>           */
>>>>>>>          local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
>> get
>>>> the
>>>>>> exception.
>>>>>>>
>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
>> (GNU
>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>>>>>> 6Machine model: Pliops Spider MK-I EVK
>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>> pc : setup_arch+0x13c/0x5ac
>>>>>>> lr : setup_arch+0x134/0x5ac
>>>>>>> sp : ffff8000808f3da0
>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>>>>>> 0000000005e31b58c
>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>>>>>> ffff8000808f8000c
>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
>>>> ffff800080010000c
>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
>> 000000002266684ac
>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>>>>>> 0000000000000008c
>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
>> 0000000000000003c
>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
>>>> 0000000000000038c
>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
>>>> 0000000000000001c
>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>>>>>> 0000000000000065c
>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>>>>>> 00000000000000c0c
>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>> Call trace:
>>>>>>>      dump_backtrace+0x9c/0xd0
>>>>>>>      show_stack+0x14/0x1c
>>>>>>>      dump_stack_lvl+0x44/0x58
>>>>>>>      dump_stack+0x14/0x1c
>>>>>>>      panic+0x2e0/0x33c
>>>>>>>      nmi_panic+0x68/0x6c
>>>>>>>      arm64_serror_panic+0x68/0x78
>>>>>>>      do_serror+0x24/0x54
>>>>>>>      el1h_64_error_handler+0x2c/0x40
>>>>>>>      el1h_64_error+0x64/0x68
>>>>>>>      setup_arch+0x13c/0x5ac
>>>>>>>      start_kernel+0x5c/0x5b8
>>>>>>>      __primary_switched+0xb4/0xbc
>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>>>>>
>>>>>>> Can you please advice how to proceed with debugging?
>>>>>>>
>>>>>>> Thanks in advanced,
>>>>>>> Cheers,
>>>>>>> Lior.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
> 


^ permalink raw reply

* Re: Debugging early SError exception
From: Heiko Schocher @ 2023-12-21 11:36 UTC (permalink / raw)
  To: Lior Weintraub; +Cc: Dirk Behme, linux-embedded@vger.kernel.org
In-Reply-To: <8140c4c7-10d5-46dc-8c32-8bee7bf95918@gmail.com>

Hi Lior,

On 21.12.23 12:19, Dirk Behme wrote:
> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
>> Thanks Dirk,
>>
>> Regarding the earlyprintk, not sure I know how to make it work.
>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my config but it doesn't seem to work.
>> Do I need to pass something in the bootargs from the U-BOOT?
>> Do I need to add that into my device tree?
>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen" on my DT but it didn't
>> work)
> 
> Yes, what has to be enabled and what not and what has to be set how is often confusing. I think this
> is not common for all systems, so I think to be on the safe side you have to look into the code for
> you system. Or short; The code is the documentation ;)
> 
> 
>> The UART I am using is "snps,dw-apb-uart".
>>
>> Last week, to output the early logs I have implemented this hack:
>> 1. Modify printk macro to run my print_func
>> 2. This print_func wrote the characters into a single global variable (u32 simul_uart;)
>> 3. Get the address location of this global variable and extract all writes to it from the Tarmac
>> logs.
>>
>> This is a very slow and tedious process but it helped me identify the initial SError.
>> Initially I thought I can write directly into the UART FIFO register (which I know the address)
>> but this didn't work because Linux already setup the MMU so I guess I need to know the virtual
>> address of this FIFO.
>> Do I need to use __phys_to_virt of some sort?
> 
> Yes, I think so. Have a look to the existing serial driver, too. It should do whats needed, and you
> can borrow that, then.

If you have access to the RAM after the crash (through a debugger or in
your bootloader) and your mem is stable, find out the address of __log_buf
in System.map. Thats the buffer where printk writes into it, and so dumping
the content is what you would see in case uart works...

Hope it helps!

bye,
Heiko
> 
> Best regards
> 
> Dirk
> 
> 
>> Cheers,
>> Lior.
>>
>>> -----Original Message-----
>>> From: Dirk Behme <dirk.behme@gmail.com>
>>> Sent: Thursday, December 21, 2023 10:30 AM
>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>> Subject: Re: Debugging early SError exception
>>>
>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> CAUTION: External Sender
>>>
>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
>>>> Hi Dirk,
>>>>
>>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
>>> v2).
>>>
>>> Glad to hear that! :)
>>>
>>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
>>> FIFO without checking if the FIFO is full.
>>>> Once the fifo got full it caused this SError probably because the UART IP
>>> generated an apberror signal.
>>>
>>> Thanks for the report!
>>>
>>>> Now the Linux is running and doesn't report the SError again but now we
>>> face another issue.
>>>> We see that the PC is getting into a "report_bug" function.
>>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
>>> the point where the console is configured?).
>>>
>>> For cases like this using earlyprintk is usually a good option. Check
>>> the Linux kernel serial console (UART) dirver of you SoC if it
>>> supports it. In the end it should be "just" a function in the serial
>>> console driver which outputs the console data via polling before
>>> (later) the interrupt driven console part takes over.
>>>
>>> Best regards
>>>
>>> Dirk
>>>
>>>
>>>> Since our debug means are limited it can take some time to find the root
>>> cause.
>>>>
>>>> I will keep you posted and update our findings.
>>>> Love to hear your thoughts,
>>>>
>>>> Cheers,
>>>> Lior.
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>>> Subject: Re: Debugging early SError exception
>>>>>
>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> CAUTION: External Sender
>>>>>
>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
>>>>>> Thanks Dirk,
>>>>>
>>>>> Welcome :)
>>>>>
>>>>> In case you find the root cause it would be nice to get some generic
>>>>> description of it so that we can learn something :)
>>>>>
>>>>> Best regards
>>>>>
>>>>> Dirk
>>>>>
>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
>>> embedded@vger.kernel.org
>>>>>>> Subject: Re: Debugging early SError exception
>>>>>>>
>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
>>> is
>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>
>>>>>>> CAUTION: External Sender
>>>>>>>
>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
>>>>>>>> This SoC is ARM64 (A53) single core based device.
>>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
>>>>>>> (Synopsys Zebu running our SoC model).
>>>>>>>> There is no debugger connected to this emulation but there are several
>>>>>>> debug capabilities we can use:
>>>>>>>> 1. Generating wave dump of CPU signals
>>>>>>>> 2. Generate a Tarmac log
>>>>>>>> 3. UART
>>>>>>>>
>>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
>>>>>>> enabled yet.
>>>>>>>>     From the Tarmac log we can see:
>>>>>>>>      3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>>>>>>> (parse_early_param)
>>>>>>>>      3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
>>>>> x0,
>>>>>>> #0xc0   //      #192    (setup_arch)
>>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
>>>>>>>>      3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>>>>>>> daif,   x0      (setup_arch)
>>>>>>>>                         R CPSR 600000c5
>>>>>>>>      3824884529 ps  ES  System Error (Abort)
>>>>>>>>                         EXC [0x380] SError/vSError Current EL with SP_ELx
>>>>>>>>                         R ESR_EL1 (AARCH64) bf000002
>>>>>>>>                         R CPSR 600003c5
>>>>>>>>                         R SPSR_EL1 (AARCH64) 600000c5
>>>>>>>>                         R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>>>>>>      3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
>>>>> sp,
>>>>>>> sp,     #0x150  (vectors)
>>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>      3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
>>>>> sp,
>>>>>>> sp,     x0      (vectors)
>>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>>>>>>      3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
>>>>> x0,
>>>>>>> sp,     x0      (vectors)
>>>>>>>>                         R X0 (AARCH64) ffff8000 808f3c50
>>>>>>>>      3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
>>>>> w0,
>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>>>>>>      3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
>>>>> x0,
>>>>>>> sp,     x0      (vectors)
>>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
>>>>>>>>      3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
>>> sp,
>>>>>>> sp,     x0      (vectors)
>>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>      3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
>>>>>>>>
>>>>>>>> If I understand correctly, the exception happened sometime earlier and
>>>>> only
>>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
>>> a
>>>>>>> result we immediately jump to the SError exception handler.
>>>>>>>
>>>>>>>
>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
>>>>>>> running something "quite new" on some software (QEMU) and
>>> hardware
>>>>>>> (Synopsis) simulators.
>>>>>>>
>>>>>>> That would mean that you have new hardware with e.g. new memory
>>> map
>>>>>>> not used before. What you describe might sound like in the code before
>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
>>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
>>>>>>> might be that you try to access (read/write) an address what is not
>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
>>>>>>> are able to modify the code before Linux (the boot loader?) you might
>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
>>>>>>> with that make the search window smaller. I'm not that familiar with
>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
>>>>>>> code does. And with that analyse all accesses and with that check if
>>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
>>>>>>> system? That should be checked from valid address and from hardware
>>>>>>> subsystem enablement point of view.
>>>>>>>
>>>>>>> Hth,
>>>>>>>
>>>>>>> Dirk
>>>>>>>
>>>>>>>
>>>>>>>>     From the Linux source:
>>>>>>>>          parse_early_param();
>>>>>>>>
>>>>>>>>          dynamic_scs_init();
>>>>>>>>
>>>>>>>>          /*
>>>>>>>>           * Unmask asynchronous aborts and fiq after bringing up possible
>>>>>>>>           * earlycon. (Report possible System Errors once we can report this
>>>>>>>>           * occurred).
>>>>>>>>           */
>>>>>>>>          local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
>>> get
>>>>> the
>>>>>>> exception.
>>>>>>>>
>>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
>>> (GNU
>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>> pc : setup_arch+0x13c/0x5ac
>>>>>>>> lr : setup_arch+0x134/0x5ac
>>>>>>>> sp : ffff8000808f3da0
>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>>>>>>> 0000000005e31b58c
>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>>>>>>> ffff8000808f8000c
>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
>>>>> ffff800080010000c
>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
>>> 000000002266684ac
>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>>>>>>> 0000000000000008c
>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
>>> 0000000000000003c
>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
>>>>> 0000000000000038c
>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
>>>>> 0000000000000001c
>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>>>>>>> 0000000000000065c
>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>>>>>>> 00000000000000c0c
>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>> Call trace:
>>>>>>>>      dump_backtrace+0x9c/0xd0
>>>>>>>>      show_stack+0x14/0x1c
>>>>>>>>      dump_stack_lvl+0x44/0x58
>>>>>>>>      dump_stack+0x14/0x1c
>>>>>>>>      panic+0x2e0/0x33c
>>>>>>>>      nmi_panic+0x68/0x6c
>>>>>>>>      arm64_serror_panic+0x68/0x78
>>>>>>>>      do_serror+0x24/0x54
>>>>>>>>      el1h_64_error_handler+0x2c/0x40
>>>>>>>>      el1h_64_error+0x64/0x68
>>>>>>>>      setup_arch+0x13c/0x5ac
>>>>>>>>      start_kernel+0x5c/0x5b8
>>>>>>>>      __primary_switched+0xb4/0xbc
>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>>>>>>
>>>>>>>> Can you please advice how to proceed with debugging?
>>>>>>>>
>>>>>>>> Thanks in advanced,
>>>>>>>> Cheers,
>>>>>>>> Lior.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
> 

-- 
DENX Software Engineering GmbH,      Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de

^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-21 12:04 UTC (permalink / raw)
  To: hs@denx.de; +Cc: Dirk Behme, linux-embedded@vger.kernel.org
In-Reply-To: <a288e1c4-8637-34bc-b6a3-c9aa3edb22e6@denx.de>

Thanks Heiko,
Will do that.
Cheers,
Lior.

> -----Original Message-----
> From: Heiko Schocher <hs@denx.de>
> Sent: Thursday, December 21, 2023 1:37 PM
> To: Lior Weintraub <liorw@pliops.com>
> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from hs@denx.de. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Hi Lior,
> 
> On 21.12.23 12:19, Dirk Behme wrote:
> > Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> >> Thanks Dirk,
> >>
> >> Regarding the earlyprintk, not sure I know how to make it work.
> >> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my
> config but it doesn't seem to work.
> >> Do I need to pass something in the bootargs from the U-BOOT?
> >> Do I need to add that into my device tree?
> >> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> "chosen" on my DT but it didn't
> >> work)
> >
> > Yes, what has to be enabled and what not and what has to be set how is
> often confusing. I think this
> > is not common for all systems, so I think to be on the safe side you have to
> look into the code for
> > you system. Or short; The code is the documentation ;)
> >
> >
> >> The UART I am using is "snps,dw-apb-uart".
> >>
> >> Last week, to output the early logs I have implemented this hack:
> >> 1. Modify printk macro to run my print_func
> >> 2. This print_func wrote the characters into a single global variable (u32
> simul_uart;)
> >> 3. Get the address location of this global variable and extract all writes to it
> from the Tarmac
> >> logs.
> >>
> >> This is a very slow and tedious process but it helped me identify the initial
> SError.
> >> Initially I thought I can write directly into the UART FIFO register (which I
> know the address)
> >> but this didn't work because Linux already setup the MMU so I guess I need
> to know the virtual
> >> address of this FIFO.
> >> Do I need to use __phys_to_virt of some sort?
> >
> > Yes, I think so. Have a look to the existing serial driver, too. It should do
> whats needed, and you
> > can borrow that, then.
> 
> If you have access to the RAM after the crash (through a debugger or in
> your bootloader) and your mem is stable, find out the address of __log_buf
> in System.map. Thats the buffer where printk writes into it, and so dumping
> the content is what you would see in case uart works...
> 
> Hope it helps!
> 
> bye,
> Heiko
> >
> > Best regards
> >
> > Dirk
> >
> >
> >> Cheers,
> >> Lior.
> >>
> >>> -----Original Message-----
> >>> From: Dirk Behme <dirk.behme@gmail.com>
> >>> Sent: Thursday, December 21, 2023 10:30 AM
> >>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >>> Subject: Re: Debugging early SError exception
> >>>
> >>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> CAUTION: External Sender
> >>>
> >>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> >>>> Hi Dirk,
> >>>>
> >>>> We found that the issue was at the early stages of Barebox (a.k.a U-
> BOOT
> >>> v2).
> >>>
> >>> Glad to hear that! :)
> >>>
> >>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
> >>> FIFO without checking if the FIFO is full.
> >>>> Once the fifo got full it caused this SError probably because the UART IP
> >>> generated an apberror signal.
> >>>
> >>> Thanks for the report!
> >>>
> >>>> Now the Linux is running and doesn't report the SError again but now we
> >>> face another issue.
> >>>> We see that the PC is getting into a "report_bug" function.
> >>>> The Linux doesn't print anything to the UART (probably since it hasn't got
> to
> >>> the point where the console is configured?).
> >>>
> >>> For cases like this using earlyprintk is usually a good option. Check
> >>> the Linux kernel serial console (UART) dirver of you SoC if it
> >>> supports it. In the end it should be "just" a function in the serial
> >>> console driver which outputs the console data via polling before
> >>> (later) the interrupt driven console part takes over.
> >>>
> >>> Best regards
> >>>
> >>> Dirk
> >>>
> >>>
> >>>> Since our debug means are limited it can take some time to find the root
> >>> cause.
> >>>>
> >>>> I will keep you posted and update our findings.
> >>>> Love to hear your thoughts,
> >>>>
> >>>> Cheers,
> >>>> Lior.
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> >>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> embedded@vger.kernel.org
> >>>>> Subject: Re: Debugging early SError exception
> >>>>>
> >>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
> is
> >>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>
> >>>>> CAUTION: External Sender
> >>>>>
> >>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> >>>>>> Thanks Dirk,
> >>>>>
> >>>>> Welcome :)
> >>>>>
> >>>>> In case you find the root cause it would be nice to get some generic
> >>>>> description of it so that we can learn something :)
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Dirk
> >>>>>
> >>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> >>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> >>> embedded@vger.kernel.org
> >>>>>>> Subject: Re: Debugging early SError exception
> >>>>>>>
> >>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why
> this
> >>> is
> >>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>>
> >>>>>>> CAUTION: External Sender
> >>>>>>>
> >>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> >>>>>>>> This SoC is ARM64 (A53) single core based device.
> >>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> platform
> >>>>>>> (Synopsys Zebu running our SoC model).
> >>>>>>>> There is no debugger connected to this emulation but there are
> several
> >>>>>>> debug capabilities we can use:
> >>>>>>>> 1. Generating wave dump of CPU signals
> >>>>>>>> 2. Generate a Tarmac log
> >>>>>>>> 3. UART
> >>>>>>>>
> >>>>>>>> Since the SError happens at early stages of Linux boot the UART is
> not
> >>>>>>> enabled yet.
> >>>>>>>>     From the Tarmac log we can see:
> >>>>>>>>      3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:
> ret
> >>>>>>> (parse_early_param)
> >>>>>>>>      3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:
> mov
> >>>>> x0,
> >>>>>>> #0xc0   //      #192    (setup_arch)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:
> msr
> >>>>>>> daif,   x0      (setup_arch)
> >>>>>>>>                         R CPSR 600000c5
> >>>>>>>>      3824884529 ps  ES  System Error (Abort)
> >>>>>>>>                         EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>>>>>>                         R ESR_EL1 (AARCH64) bf000002
> >>>>>>>>                         R CPSR 600003c5
> >>>>>>>>                         R SPSR_EL1 (AARCH64) 600000c5
> >>>>>>>>                         R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:
> sub
> >>>>> sp,
> >>>>>>> sp,     #0x150  (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:
> add
> >>>>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>>>>>>      3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:
> sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:
> tbnz
> >>>>> w0,
> >>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>>>>>>      3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:
> sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:
> sub
> >>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> >>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> >>>>>>>>
> >>>>>>>> If I understand correctly, the exception happened sometime earlier
> and
> >>>>> only
> >>>>>>> now Linux boot code (setup_arch) opened the exception handling
> and as
> >>> a
> >>>>>>> result we immediately jump to the SError exception handler.
> >>>>>>>
> >>>>>>>
> >>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> >>>>>>> running something "quite new" on some software (QEMU) and
> >>> hardware
> >>>>>>> (Synopsis) simulators.
> >>>>>>>
> >>>>>>> That would mean that you have new hardware with e.g. new
> memory
> >>> map
> >>>>>>> not used before. What you describe might sound like in the code
> before
> >>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> >>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
> >>>>>>> might be that you try to access (read/write) an address what is not
> >>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> >>>>>>> are able to modify the code before Linux (the boot loader?) you might
> >>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> >>>>>>> with that make the search window smaller. I'm not that familiar with
> >>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
> >>>>>>> code does. And with that analyse all accesses and with that check if
> >>>>>>> all these accesses are valid even on the hardware (Synopsis)
> emulation
> >>>>>>> system? That should be checked from valid address and from
> hardware
> >>>>>>> subsystem enablement point of view.
> >>>>>>>
> >>>>>>> Hth,
> >>>>>>>
> >>>>>>> Dirk
> >>>>>>>
> >>>>>>>
> >>>>>>>>     From the Linux source:
> >>>>>>>>          parse_early_param();
> >>>>>>>>
> >>>>>>>>          dynamic_scs_init();
> >>>>>>>>
> >>>>>>>>          /*
> >>>>>>>>           * Unmask asynchronous aborts and fiq after bringing up
> possible
> >>>>>>>>           * earlycon. (Report possible System Errors once we can report
> this
> >>>>>>>>           * occurred).
> >>>>>>>>           */
> >>>>>>>>          local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when
> we
> >>> get
> >>>>> the
> >>>>>>> exception.
> >>>>>>>>
> >>>>>>>> After some kernel hacking (replacing printk) we could extract the
> logs:
> >>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-
> gnu-
> >>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
> >>> (GNU
> >>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> >>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>>>>> pc : setup_arch+0x13c/0x5ac
> >>>>>>>> lr : setup_arch+0x134/0x5ac
> >>>>>>>> sp : ffff8000808f3da0
> >>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >>>>>>> 0000000005e31b58c
> >>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >>>>>>> ffff8000808f8000c
> >>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> >>>>> ffff800080010000c
> >>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> >>> 000000002266684ac
> >>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >>>>>>> 0000000000000008c
> >>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> >>> 0000000000000003c
> >>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> >>>>> 0000000000000038c
> >>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> >>>>> 0000000000000001c
> >>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >>>>>>> 0000000000000065c
> >>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >>>>>>> 00000000000000c0c
> >>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> Call trace:
> >>>>>>>>      dump_backtrace+0x9c/0xd0
> >>>>>>>>      show_stack+0x14/0x1c
> >>>>>>>>      dump_stack_lvl+0x44/0x58
> >>>>>>>>      dump_stack+0x14/0x1c
> >>>>>>>>      panic+0x2e0/0x33c
> >>>>>>>>      nmi_panic+0x68/0x6c
> >>>>>>>>      arm64_serror_panic+0x68/0x78
> >>>>>>>>      do_serror+0x24/0x54
> >>>>>>>>      el1h_64_error_handler+0x2c/0x40
> >>>>>>>>      el1h_64_error+0x64/0x68
> >>>>>>>>      setup_arch+0x13c/0x5ac
> >>>>>>>>      start_kernel+0x5c/0x5b8
> >>>>>>>>      __primary_switched+0xb4/0xbc
> >>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt
> ]---
> >>>>>>>>
> >>>>>>>> Can you please advice how to proceed with debugging?
> >>>>>>>>
> >>>>>>>> Thanks in advanced,
> >>>>>>>> Cheers,
> >>>>>>>> Lior.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
> 
> --
> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de

^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-22  7:03 UTC (permalink / raw)
  To: hs@denx.de, Dirk Behme; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <a288e1c4-8637-34bc-b6a3-c9aa3edb22e6@denx.de>

Hi,

I managed to dump the __log_buf but for some reason the UART is still not working.
Please note that UART printed all the U-BOOT traces so AFAIU, the device tree is set correctly.
(Barebox is passing it's DTB into kernel).

To enable the earlyprintk I have:
1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
2. Modified the boot args to include: "console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd000307000"
3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart", early_serial8250_setup);

From __log_buf dump:
Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107 SMP Thu Dec 21 17:33:12 IST 202323
Machine model: Pliops Spider MK-I EVKVK
efi: UEFI not found.d.
Zone ranges:s:
  DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
  DMA32    emptyty
  Normal   emptyty
Movable zone start for each nodede
Early memory node rangeses
  node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]f]
percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
pcpu-alloc: [0] 0 
Detected VIPT I-cache on CPU0U0
CPU features: GIC system register CPU interface present but disabled by higher exception levelel
CPU features: detected: ARM erratum 84571919
alternatives: applying boot alternativeses
Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd00030700000
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)r)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
Built 1 zonelists, mobility grouping on.  Total pages: 19353636
mem auto-init: stack:off, heap alloc:off, heap free:offff
software IO TLB: area num 1.1.
software IO TLB: mapped [mem 0x000000002b080000-0x000000002f080000] (64MB)B)
Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)d)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
trace event string verifier disableded
rcu: Hierarchical RCU implementation.n.
rcu: 	RCU event tracing is enabled.d.
rcu: 	RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
GICv3: 96 SPIs implementeded
GICv3: 0 Extended SPIs implementeded
Root IRQ handler: gic_handle_irqrq
GICv3: GICv3 features: 16 PPIsIs
GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
GICv3: redistributor failed to wakeup.....
GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
Internal error: Oops - Undefined instruction: 0000000062383019 [#1] SMPMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
Hardware name: Pliops Spider MK-I EVK (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : gic_cpu_sys_reg_init+0x58/0x2e4
lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
sp : ffff8000808f3b40
x29: ffff8000808f3b40 x28: 0000000000000000 x27: 0000000000000001
x26: ffff000000016040 x25: 0000000000000000 x24: ffff800080a6b000
x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
x11: 452074612064656c x10: 6261736964282045 x9 : 6428204552532074
x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
x5 : 000000000000002a x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
Call trace:
 gic_cpu_sys_reg_init+0x58/0x2e4
 gic_cpu_init.part.0+0xa8/0x114
 gic_init_bases+0x408/0x684
 gic_of_init+0x298/0x300
 of_irq_init+0x1c8/0x368
 irqchip_init+0x14/0x1c
 init_IRQ+0x98/0xac
 start_kernel+0x250/0x5b8
 __primary_switched+0xb4/0xbc
Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) ) 
---[ end trace 0000000000000000 ]-----
Kernel panic - not syncing: Attempted to kill the idle task!k!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----


The kernel panic is related to GIC distributor (currently under debug) but AFAIU, 
this has nothing to do with the UART not working on early stages.

Thanks in advanced for your advice,
Cheers,
Lior.
 


> -----Original Message-----
> From: Heiko Schocher <hs@denx.de>
> Sent: Thursday, December 21, 2023 1:37 PM
> To: Lior Weintraub <liorw@pliops.com>
> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from hs@denx.de. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Hi Lior,
> 
> On 21.12.23 12:19, Dirk Behme wrote:
> > Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> >> Thanks Dirk,
> >>
> >> Regarding the earlyprintk, not sure I know how to make it work.
> >> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my
> config but it doesn't seem to work.
> >> Do I need to pass something in the bootargs from the U-BOOT?
> >> Do I need to add that into my device tree?
> >> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen"
> on my DT but it didn't
> >> work)
> >
> > Yes, what has to be enabled and what not and what has to be set how is often
> confusing. I think this
> > is not common for all systems, so I think to be on the safe side you have to look
> into the code for
> > you system. Or short; The code is the documentation ;)
> >
> >
> >> The UART I am using is "snps,dw-apb-uart".
> >>
> >> Last week, to output the early logs I have implemented this hack:
> >> 1. Modify printk macro to run my print_func
> >> 2. This print_func wrote the characters into a single global variable (u32
> simul_uart;)
> >> 3. Get the address location of this global variable and extract all writes to it
> from the Tarmac
> >> logs.
> >>
> >> This is a very slow and tedious process but it helped me identify the initial
> SError.
> >> Initially I thought I can write directly into the UART FIFO register (which I know
> the address)
> >> but this didn't work because Linux already setup the MMU so I guess I need to
> know the virtual
> >> address of this FIFO.
> >> Do I need to use __phys_to_virt of some sort?
> >
> > Yes, I think so. Have a look to the existing serial driver, too. It should do whats
> needed, and you
> > can borrow that, then.
> 
> If you have access to the RAM after the crash (through a debugger or in
> your bootloader) and your mem is stable, find out the address of __log_buf
> in System.map. Thats the buffer where printk writes into it, and so dumping
> the content is what you would see in case uart works...
> 
> Hope it helps!
> 
> bye,
> Heiko
> >
> > Best regards
> >
> > Dirk
> >
> >
> >> Cheers,
> >> Lior.
> >>
> >>> -----Original Message-----
> >>> From: Dirk Behme <dirk.behme@gmail.com>
> >>> Sent: Thursday, December 21, 2023 10:30 AM
> >>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >>> Subject: Re: Debugging early SError exception
> >>>
> >>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> CAUTION: External Sender
> >>>
> >>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> >>>> Hi Dirk,
> >>>>
> >>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
> >>> v2).
> >>>
> >>> Glad to hear that! :)
> >>>
> >>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
> >>> FIFO without checking if the FIFO is full.
> >>>> Once the fifo got full it caused this SError probably because the UART IP
> >>> generated an apberror signal.
> >>>
> >>> Thanks for the report!
> >>>
> >>>> Now the Linux is running and doesn't report the SError again but now we
> >>> face another issue.
> >>>> We see that the PC is getting into a "report_bug" function.
> >>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
> >>> the point where the console is configured?).
> >>>
> >>> For cases like this using earlyprintk is usually a good option. Check
> >>> the Linux kernel serial console (UART) dirver of you SoC if it
> >>> supports it. In the end it should be "just" a function in the serial
> >>> console driver which outputs the console data via polling before
> >>> (later) the interrupt driven console part takes over.
> >>>
> >>> Best regards
> >>>
> >>> Dirk
> >>>
> >>>
> >>>> Since our debug means are limited it can take some time to find the root
> >>> cause.
> >>>>
> >>>> I will keep you posted and update our findings.
> >>>> Love to hear your thoughts,
> >>>>
> >>>> Cheers,
> >>>> Lior.
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> >>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >>>>> Subject: Re: Debugging early SError exception
> >>>>>
> >>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>
> >>>>> CAUTION: External Sender
> >>>>>
> >>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> >>>>>> Thanks Dirk,
> >>>>>
> >>>>> Welcome :)
> >>>>>
> >>>>> In case you find the root cause it would be nice to get some generic
> >>>>> description of it so that we can learn something :)
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Dirk
> >>>>>
> >>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> >>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> >>> embedded@vger.kernel.org
> >>>>>>> Subject: Re: Debugging early SError exception
> >>>>>>>
> >>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
> >>> is
> >>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>>
> >>>>>>> CAUTION: External Sender
> >>>>>>>
> >>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> >>>>>>>> This SoC is ARM64 (A53) single core based device.
> >>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
> >>>>>>> (Synopsys Zebu running our SoC model).
> >>>>>>>> There is no debugger connected to this emulation but there are several
> >>>>>>> debug capabilities we can use:
> >>>>>>>> 1. Generating wave dump of CPU signals
> >>>>>>>> 2. Generate a Tarmac log
> >>>>>>>> 3. UART
> >>>>>>>>
> >>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
> >>>>>>> enabled yet.
> >>>>>>>>     From the Tarmac log we can see:
> >>>>>>>>      3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
> >>>>>>> (parse_early_param)
> >>>>>>>>      3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
> >>>>> x0,
> >>>>>>> #0xc0   //      #192    (setup_arch)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
> >>>>>>> daif,   x0      (setup_arch)
> >>>>>>>>                         R CPSR 600000c5
> >>>>>>>>      3824884529 ps  ES  System Error (Abort)
> >>>>>>>>                         EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>>>>>>                         R ESR_EL1 (AARCH64) bf000002
> >>>>>>>>                         R CPSR 600003c5
> >>>>>>>>                         R SPSR_EL1 (AARCH64) 600000c5
> >>>>>>>>                         R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
> >>>>> sp,
> >>>>>>> sp,     #0x150  (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
> >>>>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>>>>>>      3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
> >>>>> w0,
> >>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>>>>>>      3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
> >>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> >>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> >>>>>>>>
> >>>>>>>> If I understand correctly, the exception happened sometime earlier
> and
> >>>>> only
> >>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
> >>> a
> >>>>>>> result we immediately jump to the SError exception handler.
> >>>>>>>
> >>>>>>>
> >>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> >>>>>>> running something "quite new" on some software (QEMU) and
> >>> hardware
> >>>>>>> (Synopsis) simulators.
> >>>>>>>
> >>>>>>> That would mean that you have new hardware with e.g. new memory
> >>> map
> >>>>>>> not used before. What you describe might sound like in the code before
> >>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> >>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
> >>>>>>> might be that you try to access (read/write) an address what is not
> >>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> >>>>>>> are able to modify the code before Linux (the boot loader?) you might
> >>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> >>>>>>> with that make the search window smaller. I'm not that familiar with
> >>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
> >>>>>>> code does. And with that analyse all accesses and with that check if
> >>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
> >>>>>>> system? That should be checked from valid address and from hardware
> >>>>>>> subsystem enablement point of view.
> >>>>>>>
> >>>>>>> Hth,
> >>>>>>>
> >>>>>>> Dirk
> >>>>>>>
> >>>>>>>
> >>>>>>>>     From the Linux source:
> >>>>>>>>          parse_early_param();
> >>>>>>>>
> >>>>>>>>          dynamic_scs_init();
> >>>>>>>>
> >>>>>>>>          /*
> >>>>>>>>           * Unmask asynchronous aborts and fiq after bringing up possible
> >>>>>>>>           * earlycon. (Report possible System Errors once we can report
> this
> >>>>>>>>           * occurred).
> >>>>>>>>           */
> >>>>>>>>          local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
> >>> get
> >>>>> the
> >>>>>>> exception.
> >>>>>>>>
> >>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
> >>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> >>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
> >>> (GNU
> >>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> >>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>>>>> pc : setup_arch+0x13c/0x5ac
> >>>>>>>> lr : setup_arch+0x134/0x5ac
> >>>>>>>> sp : ffff8000808f3da0
> >>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >>>>>>> 0000000005e31b58c
> >>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >>>>>>> ffff8000808f8000c
> >>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> >>>>> ffff800080010000c
> >>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> >>> 000000002266684ac
> >>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >>>>>>> 0000000000000008c
> >>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> >>> 0000000000000003c
> >>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> >>>>> 0000000000000038c
> >>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> >>>>> 0000000000000001c
> >>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >>>>>>> 0000000000000065c
> >>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >>>>>>> 00000000000000c0c
> >>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> Call trace:
> >>>>>>>>      dump_backtrace+0x9c/0xd0
> >>>>>>>>      show_stack+0x14/0x1c
> >>>>>>>>      dump_stack_lvl+0x44/0x58
> >>>>>>>>      dump_stack+0x14/0x1c
> >>>>>>>>      panic+0x2e0/0x33c
> >>>>>>>>      nmi_panic+0x68/0x6c
> >>>>>>>>      arm64_serror_panic+0x68/0x78
> >>>>>>>>      do_serror+0x24/0x54
> >>>>>>>>      el1h_64_error_handler+0x2c/0x40
> >>>>>>>>      el1h_64_error+0x64/0x68
> >>>>>>>>      setup_arch+0x13c/0x5ac
> >>>>>>>>      start_kernel+0x5c/0x5b8
> >>>>>>>>      __primary_switched+0xb4/0xbc
> >>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >>>>>>>>
> >>>>>>>> Can you please advice how to proceed with debugging?
> >>>>>>>>
> >>>>>>>> Thanks in advanced,
> >>>>>>>> Cheers,
> >>>>>>>> Lior.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
> 
> --
> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox