Embedded Linux development
 help / color / mirror / Atom feed
* RE: elinux.org wiki style
From: Bird, Tim @ 2024-10-30  0:01 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux Embedded
In-Reply-To: <CAMuHMdXN3J=_xFQkuhcLxhniNK0LqTH9rFM9Ydex1TUufuBw3w@mail.gmail.com>



> -----Original Message-----
> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Hi Tim,
> 
> https://elinux.org/ is looking weird today.
> 
> There seems to be something wrong with the stylesheet.
> When trying to open it:
> 
> [b554204955cf12a33d7baee1]
> /load.php?debug=false&lang=en&modules=ext.echo.badgeicons%7Cext.echo.styles.badge%7Cmediawiki.legacy.commonPrint%2Cshar
> ed%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles&only=styles&skin=vector
> Error from line 689 of
> /var/www/elinux.org/htdocs/includes/exception/MWExceptionHandler.php:
> Class 'FormatJson' not found
> 
> Backtrace:
> 
> #0 /var/www/elinux.org/htdocs/includes/exception/MWExceptionHandler.php(216):
> MWExceptionHandler::logError(ErrorException, string, string)
> #1 /var/www/elinux.org/htdocs/includes/AutoLoader.php(109):
> MWExceptionHandler::handleError(integer, string, string, integer,
> array)
> #2 /var/www/elinux.org/htdocs/includes/AutoLoader.php(109): require()
> #3 [internal function]: AutoLoader::autoload(string)
> #4 /var/www/elinux.org/htdocs/includes/resourceloader/ResourceLoader.php(141):
> spl_autoload_call(string)
> #5 /var/www/elinux.org/htdocs/includes/resourceloader/ResourceLoader.php(751):
> ResourceLoader->preloadModuleInfo(array, ResourceLoaderContext)
> #6 /var/www/elinux.org/htdocs/load.php(51):
> ResourceLoader->respond(ResourceLoaderContext)
> #7 {main}
> 

Yes. I saw the format error yesterday, and asked Bill Traynor to look into it.
Thanks for the details and backtrace.  I'll send those on to Bill.
 -- Tim


^ permalink raw reply

* elinux.org wiki style
From: Geert Uytterhoeven @ 2024-10-29 13:17 UTC (permalink / raw)
  To: Bird, Timothy; +Cc: Linux Embedded

Hi Tim,

https://elinux.org/ is looking weird today.

There seems to be something wrong with the stylesheet.
When trying to open it:

[b554204955cf12a33d7baee1]
/load.php?debug=false&lang=en&modules=ext.echo.badgeicons%7Cext.echo.styles.badge%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles&only=styles&skin=vector
Error from line 689 of
/var/www/elinux.org/htdocs/includes/exception/MWExceptionHandler.php:
Class 'FormatJson' not found

Backtrace:

#0 /var/www/elinux.org/htdocs/includes/exception/MWExceptionHandler.php(216):
MWExceptionHandler::logError(ErrorException, string, string)
#1 /var/www/elinux.org/htdocs/includes/AutoLoader.php(109):
MWExceptionHandler::handleError(integer, string, string, integer,
array)
#2 /var/www/elinux.org/htdocs/includes/AutoLoader.php(109): require()
#3 [internal function]: AutoLoader::autoload(string)
#4 /var/www/elinux.org/htdocs/includes/resourceloader/ResourceLoader.php(141):
spl_autoload_call(string)
#5 /var/www/elinux.org/htdocs/includes/resourceloader/ResourceLoader.php(751):
ResourceLoader->preloadModuleInfo(array, ResourceLoaderContext)
#6 /var/www/elinux.org/htdocs/load.php(51):
ResourceLoader->respond(ResourceLoaderContext)
#7 {main}

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: Boot-time initiative (SIG) thoughts and next steps
From: Saravana Kannan @ 2024-10-28 22:33 UTC (permalink / raw)
  To: Bird, Tim; +Cc: linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <MW5PR13MB5632E4EFFD802E0839027A51FD4A2@MW5PR13MB5632.namprd13.prod.outlook.com>

On Sun, Oct 27, 2024 at 6:30 PM Bird, Tim <Tim.Bird@sony.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Saravana Kannan <saravanak@google.com>
> > On Fri, Oct 25, 2024 at 11:18 AM Bird, Tim <Tim.Bird@sony.com> wrote:
> > >
> > > Hey Linux developers,
> > >
> > > The response to my request to form a Special Interest Group for boot-time reduction
> > > for Linux has been really great.  Many people contacted me by e-mail and on LinkedIn.
> >
> > Hi Tim,
> >
> > Thanks for organizing this and moving it forward! I'd be interested in
> > contributing to this effort as a lot of work I have done aligns with
> > the goals of this effort and boot time is of obvious value to Android.
>
> Thanks for your interest.  I would love to have developers from Google,
> and from the Android community, involved.
>
> >
> > > I had hoped to push out a script today to start to gather data on boot-time on different
> > > platforms, for people to run who had expressed interest in helping with this effort. But
> > > I got overwhelmed with other tasks, and I may not get it done today.  I'll be in Tokyo next
> > > week for Open Source Summit Japan.  If you are there, please try to catch me and say hi.
> > > Given that, I'll see how soon I can provide the script I'm talking about, and we can
> > > discuss the goals and design of the script.
> > >
> > > A couple of quick things:
> > > There are lots of things to discuss, but here are a few things to get started with...
> > >
> > > = wiki account =
> > > The wiki where we'll be maintaining information about
> > > boot time, and about activities of the boot time SIG, is the elinux wiki.
> > > The page we'll be focusing on is: https://elinux.org/Boot_Time.
> > > If you are interested in helping update and maintain the information there
> > > (which I hope almost everyone is), then please make sure you have a user
> > > account on the wiki.
> > > If you don't have one, please go here:
> > > https://elinux.org/Special:RequestAccount
> > > I have to manually approve accounts in order to fight spambots.  It might
> > > take a few days for me to get to your request.  It's very helpful if you
> > > put a comment in one of the request fields about this being related to
> > > the boot-time initiative or SIG, so I can distinguish your request from
> > > spam requests.
> >
> > Can we instead keep this all a part of the kernel docs instead of the
> > wiki? Couple of reasons for that:
>
> Ideally, we would put some material in the wiki, and also
> produce a document - some kind of "boot-time tuning guide" that can
> live in the kernel tree.

This is the part I care most about being in the kernel docs. Eg: what
configs to use. What commandline params to set. Dos and Don'ts for the
drivers, etc. So, good to see that is an acceptable option.

> Some of the material that I think we will
> maintain will refer to boot sequences and operations outside the
> kernel (such as the bootloader or user-space), so the scope of
> the material to document is not just limited to the kernel.

Makes sense.

> Also, there will be a lot of material that will be system-specific.
> Historically, the kernel has avoided documenting things that are
> specific to an individual platform.

A lot of kernel params are arch specific and we still document them in
the kernel. So I don't there's some leeway. But yeah, doesn't make
sense to document stuff like "improve RPi 4 boot time" kinds of stuff
in the kernel. But not sure that makes sense for the elinux.org wiki
either. But we can figure this out as we go.

> Finally, a lot of this information will be ad hoc, which also doesn't
> lend itself to upstreaming.

At least for the kernel params and configs, I don't think it's that
adhoc. The rest, I don't have an opinion.

> See my response to your individual points below.
>
> > - Since the instructions can be kernel version specific (as things
> > change), it makes sense to have the document synced with the kernel.
>
> That's a good point.  The current material suffers from not being synced
> very well with kernel versions.  That is, there is a lot of obsolete material.
> My own experience is that kernel documentation also has a bit of an issue
> with being kept up-to-date, but it's not as bad as wikis often get.
>
> It would be good to have some plans and possibly mechanisms to address
> the eventual obsolescence of the material.
>
> > - It's one less account to maintain and less chores for you.
> The cost per developer is one-time, which shouldn't be too bad
> for individual developers.  I already have the role of elinux administrator,
> and so I have to approve accounts anyway.  In either case
> (contributing to a wiki or contributing upstream), there is going to
> be some overhead for reviewing the material.
>
> > - One less business approval to get in terms of contributing to
> > external sources.
> This is an interesting point.  Does Google have rules regarding contributing
> to wikis?

In all the companies I've worked at, there's always been some level of
sanity check about external contributions to make sure you aren't
accidentally/intentionally signing up the company for something the
company doesn't agree with.

The point is that I don't know what Google's position is wrt
elinux.org and I need to find it out and/or go through any paperwork
that might be necessary to get approval. All that adds a lot of
inertia, at least for me. I know we are okay contributing to LMKL, so
that makes kernel docs a frictionless process from an approval
perspective. And with the other points in mind about bit rot and
keeping things in sync with kernel, I'd prefer the kernel parts of it
being in the kernel docs.

> That is actually related to my plans to use automation to collect
> boot-time data.  My plan is to have tests that automatically send data
> to a central collection server (with data that is put into a shared, public database).
> I realize there will be some companies who won't want to share certain
> details of their in-development platforms.  When I publish the first
> script that does that (probably this week or next), we should discuss the
> ramifications of developers needing company consent for this.

My comment wasn't about this part at all. I don't think boot timing
needs to be run on any unreleased device. There are plenty of released
devices to test with. And we are actively adding Pixel 6 support to
upstream.

-Saravana

> > - Less chance of bit rot. As people make changes, the docs are right
> > there to go fix.
> You are right that bit rot is a significant risk with wikis, because there are
> no mechanisms to automatically update or remove obsolete material.
> I have some plans to fix that with some test instrumentation and upstream
> wiki processes that can automatically detect changes to published data,
> and can recommend review of material, or flag it as obsolete.
>
> My own experience is that it is significantly easier to change
> something on a wiki, than it is to change upstream kernel documentation.
> One requires just changing text using a web form, and the other requires
> an upstream-compatible e-mail based submit/review/approve/release cycle.
>
> I'm interested to learn more about the barriers that developers at Google (or other
> companies) might face in making contributions to a wiki.  Can you describe
> those obstacles in more detail?
>
> Thanks,
>  -- Tim
>
>
>

^ permalink raw reply

* RE: Boot-time initiative (SIG) thoughts and next steps
From: Bird, Tim @ 2024-10-28  1:29 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAGETcx_c2nfFQ++-FcsdUdLUo3e-oe07MkLgbuyrnq2FPrcsXQ@mail.gmail.com>



> -----Original Message-----
> From: Saravana Kannan <saravanak@google.com>
> On Fri, Oct 25, 2024 at 11:18 AM Bird, Tim <Tim.Bird@sony.com> wrote:
> >
> > Hey Linux developers,
> >
> > The response to my request to form a Special Interest Group for boot-time reduction
> > for Linux has been really great.  Many people contacted me by e-mail and on LinkedIn.
> 
> Hi Tim,
> 
> Thanks for organizing this and moving it forward! I'd be interested in
> contributing to this effort as a lot of work I have done aligns with
> the goals of this effort and boot time is of obvious value to Android.

Thanks for your interest.  I would love to have developers from Google, 
and from the Android community, involved.

> 
> > I had hoped to push out a script today to start to gather data on boot-time on different
> > platforms, for people to run who had expressed interest in helping with this effort. But
> > I got overwhelmed with other tasks, and I may not get it done today.  I'll be in Tokyo next
> > week for Open Source Summit Japan.  If you are there, please try to catch me and say hi.
> > Given that, I'll see how soon I can provide the script I'm talking about, and we can
> > discuss the goals and design of the script.
> >
> > A couple of quick things:
> > There are lots of things to discuss, but here are a few things to get started with...
> >
> > = wiki account =
> > The wiki where we'll be maintaining information about
> > boot time, and about activities of the boot time SIG, is the elinux wiki.
> > The page we'll be focusing on is: https://elinux.org/Boot_Time.
> > If you are interested in helping update and maintain the information there
> > (which I hope almost everyone is), then please make sure you have a user
> > account on the wiki.
> > If you don't have one, please go here:
> > https://elinux.org/Special:RequestAccount
> > I have to manually approve accounts in order to fight spambots.  It might
> > take a few days for me to get to your request.  It's very helpful if you
> > put a comment in one of the request fields about this being related to
> > the boot-time initiative or SIG, so I can distinguish your request from
> > spam requests.
> 
> Can we instead keep this all a part of the kernel docs instead of the
> wiki? Couple of reasons for that:

Ideally, we would put some material in the wiki, and also
produce a document - some kind of "boot-time tuning guide" that can
live in the kernel tree.    Some of the material that I think we will
maintain will refer to boot sequences and operations outside the
kernel (such as the bootloader or user-space), so the scope of 
the material to document is not just limited to the kernel.
Also, there will be a lot of material that will be system-specific.
Historically, the kernel has avoided documenting things that are
specific to an individual platform.
Finally, a lot of this information will be ad hoc, which also doesn't
lend itself to upstreaming.

See my response to your individual points below.

> - Since the instructions can be kernel version specific (as things
> change), it makes sense to have the document synced with the kernel.

That's a good point.  The current material suffers from not being synced
very well with kernel versions.  That is, there is a lot of obsolete material.
My own experience is that kernel documentation also has a bit of an issue
with being kept up-to-date, but it's not as bad as wikis often get. 

It would be good to have some plans and possibly mechanisms to address
the eventual obsolescence of the material.

> - It's one less account to maintain and less chores for you.
The cost per developer is one-time, which shouldn't be too bad
for individual developers.  I already have the role of elinux administrator,
and so I have to approve accounts anyway.  In either case
(contributing to a wiki or contributing upstream), there is going to
be some overhead for reviewing the material.

> - One less business approval to get in terms of contributing to
> external sources.
This is an interesting point.  Does Google have rules regarding contributing
to wikis?  That is actually related to my plans to use automation to collect
boot-time data.  My plan is to have tests that automatically send data
to a central collection server (with data that is put into a shared, public database).
I realize there will be some companies who won't want to share certain
details of their in-development platforms.  When I publish the first 
script that does that (probably this week or next), we should discuss the
ramifications of developers needing company consent for this.

> - Less chance of bit rot. As people make changes, the docs are right
> there to go fix.
You are right that bit rot is a significant risk with wikis, because there are
no mechanisms to automatically update or remove obsolete material.
I have some plans to fix that with some test instrumentation and upstream
wiki processes that can automatically detect changes to published data,
and can recommend review of material, or flag it as obsolete.

My own experience is that it is significantly easier to change
something on a wiki, than it is to change upstream kernel documentation.
One requires just changing text using a web form, and the other requires
an upstream-compatible e-mail based submit/review/approve/release cycle.

I'm interested to learn more about the barriers that developers at Google (or other
companies) might face in making contributions to a wiki.  Can you describe
those obstacles in more detail?

Thanks,
 -- Tim




^ permalink raw reply

* Re: Boot-time initiative (SIG) thoughts and next steps
From: Rob Landley @ 2024-10-26 18:50 UTC (permalink / raw)
  To: Saravana Kannan, Bird, Tim
  Cc: linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAGETcx_c2nfFQ++-FcsdUdLUo3e-oe07MkLgbuyrnq2FPrcsXQ@mail.gmail.com>

On 10/26/24 02:36, Saravana Kannan wrote:
> On Fri, Oct 25, 2024 at 11:18 AM Bird, Tim <Tim.Bird@sony.com> wrote:
>>
>> Hey Linux developers,
>>
>> The response to my request to form a Special Interest Group for boot-time reduction
>> for Linux has been really great.  Many people contacted me by e-mail and on LinkedIn.
> 
> Hi Tim,
> 
> Thanks for organizing this and moving it forward! I'd be interested in
> contributing to this effort as a lot of work I have done aligns with
> the goals of this effort and boot time is of obvious value to Android.

I'm kind of an edge case for this project because my mkroot images at 
https://landley.net/bin/mkroot/latest mostly boot up in a couple 
seconds. (And faster if you feed in KARGS=quiet so the kernel boot 
messages don't take time emitting and scrolling before interrupts have 
been enabled. Although "quiet" doesn't seem to work in current vanilla 
kernels...?)

The ones that _don't_ are generally because qemu's bios for that 
platform twiddles its thumbs for a long time before launching the 
kernel, although there are some slow drivers in there:

$ for i in powerpc m68k i686 s390x; do (cd $i && echo $i && KARGS='quiet 
HANDOFF=echo' bash -c 'time ./run-qemu.sh > /dev/null'); done

powerpc
real	0m6.154s
user	0m3.689s
sys	0m0.341s

m68k
real	0m4.220s
user	0m1.142s
sys	0m0.212s

i686
real	0m1.986s
user	0m1.709s
sys	0m0.209s

s390x
real	0m1.644s
user	0m1.378s
sys	0m0.228s

And that's with qemu running on a 10 year old laptop that I'll have to 
switch off of when debian drops x86-64-v2 support. (Even that i686 test 
isn't kvm.) It's running a recent-ish kernel (binaries I had lying 
around)...

# cat /proc/version
Linux version 6.11.0-rc7 (landley@driftwood) (s390x-linux-musl-gcc (GCC) 
11.4.0, GNU ld (GNU Binutils) 2.33.1) #1 SMP Sat Sep 14 01:36:19 CDT 2024

Built using the kernel config files in the "doc" directory of those 
tarballs.

Are you trying to optimize the kernel boot, or more trying to optimize 
userspace? Because my userspace init is just a small shell script:

https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L102

And the above simple test loop just told that to run "echo" instead of 
/bin/sh so I could easily collect boot-and-exit timing for the qemu 
process...

>> I had hoped to push out a script today to start to gather data on boot-time on different
>> platforms, for people to run who had expressed interest in helping with this effort. But
>> I got overwhelmed with other tasks, and I may not get it done today.  I'll be in Tokyo next
>> week for Open Source Summit Japan.  If you are there, please try to catch me and say hi.
>> Given that, I'll see how soon I can provide the script I'm talking about, and we can
>> discuss the goals and design of the script.

I regression test under qemu because it gives reproducibly scriptable 
results. I've even got plumbing to run canned tests on multiple 
architectures in parallel (part of my release testing):

https://github.com/landley/toybox/blob/master/mkroot/testroot.sh

Rob

^ permalink raw reply

* Re: Boot-time initiative (SIG) thoughts and next steps
From: Saravana Kannan @ 2024-10-26  7:36 UTC (permalink / raw)
  To: Bird, Tim; +Cc: linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <MW5PR13MB5632321E93B031C0E107DB38FD4F2@MW5PR13MB5632.namprd13.prod.outlook.com>

On Fri, Oct 25, 2024 at 11:18 AM Bird, Tim <Tim.Bird@sony.com> wrote:
>
> Hey Linux developers,
>
> The response to my request to form a Special Interest Group for boot-time reduction
> for Linux has been really great.  Many people contacted me by e-mail and on LinkedIn.

Hi Tim,

Thanks for organizing this and moving it forward! I'd be interested in
contributing to this effort as a lot of work I have done aligns with
the goals of this effort and boot time is of obvious value to Android.

> I had hoped to push out a script today to start to gather data on boot-time on different
> platforms, for people to run who had expressed interest in helping with this effort. But
> I got overwhelmed with other tasks, and I may not get it done today.  I'll be in Tokyo next
> week for Open Source Summit Japan.  If you are there, please try to catch me and say hi.
> Given that, I'll see how soon I can provide the script I'm talking about, and we can
> discuss the goals and design of the script.
>
> A couple of quick things:
> There are lots of things to discuss, but here are a few things to get started with...
>
> = wiki account =
> The wiki where we'll be maintaining information about
> boot time, and about activities of the boot time SIG, is the elinux wiki.
> The page we'll be focusing on is: https://elinux.org/Boot_Time.
> If you are interested in helping update and maintain the information there
> (which I hope almost everyone is), then please make sure you have a user
> account on the wiki.
> If you don't have one, please go here:
> https://elinux.org/Special:RequestAccount
> I have to manually approve accounts in order to fight spambots.  It might
> take a few days for me to get to your request.  It's very helpful if you
> put a comment in one of the request fields about this being related to
> the boot-time initiative or SIG, so I can distinguish your request from
> spam requests.

Can we instead keep this all a part of the kernel docs instead of the
wiki? Couple of reasons for that:
- Since the instructions can be kernel version specific (as things
change), it makes sense to have the document synced with the kernel.
- It's one less account to maintain and less chores for you.
- One less business approval to get in terms of contributing to
external sources.
- Less chance of bit rot. As people make changes, the docs are right
there to go fix.

Thanks,
Saravana

> = support for new developers =
> A number of developers have asked me if they can participate and contribute,
> even if they are not seasoned Linux kernel developers.  The answer is "Yes"!
> I hope to provide a range of activities for people to provide data, help update
> the wiki, implement and run tests and perform research - even if they don't have
> any previous Linux development experience.  I hope it will be fun to participate,
> and very educational.
>
> If you are new to Linux and have just joined this group, please review some
> of the material on the Boot_Time page mentioned above.  We will be covering
> more than just the kernel in the project, but one place to get started will be
> to look at the kernel source file init/main.c, particularly the function start_kernel()
> (which is where a lot of the "magic" happens at kernel startup time.)
> Don't be afraid to ask questions.  Please ask them on this list so that others benefit
> from any answers provided.
>
> = short-term plans =
> I am building out the "membership" of the SIG over the very short term.  I have
> some more individuals and companies to contact to see who wants to be involved.
>
> Other things I'd like to do are:
>  * start gathering boot timing data for different systems (using the script I described above)
>  * start pruning obsolete information and refactoring the boot-time material on the elinux wiki
>     * (Yes - some of the material there is quite dated, so be sure to check it out before you try to
>        use some tool or technique - if something doesn't work, please send an e-mail or mark it in the wiki)
>  * discuss planning for SIG video conference calls and meetings
>      * I know I'm interested in having a boot-time micro-conference at Embedded Linux
>      Conference next year - but we need to discuss if we want regular calls or other face-to-face
>       meetings
>  * perform a survey of existing boot-time reduction techniques, and see where they are
>     in the pipeline of upstreaming or deployment in actual products
>  * finally (for this list), brainstorm what activities the SIG should do, and how we can
>     collaborate on those.  I've started a list at: https://elinux.org/Boot_Time_Project_Ideas
>     that you can look at and comment on (either on this list, or on the wiki).
>
> I'll be busy with business travel and Sony work next week, but I hope I still
> find some time to follow up on this .  I look forward to working with many of you
> reading this, on improving this area of Linux.
>  -- Tim
>
>

^ permalink raw reply

* Boot-time initiative (SIG) thoughts and next steps
From: Bird, Tim @ 2024-10-25 18:17 UTC (permalink / raw)
  To: linux-embedded@vger.kernel.org; +Cc: linux-kernel@vger.kernel.org

Hey Linux developers,

The response to my request to form a Special Interest Group for boot-time reduction
for Linux has been really great.  Many people contacted me by e-mail and on LinkedIn.

I had hoped to push out a script today to start to gather data on boot-time on different
platforms, for people to run who had expressed interest in helping with this effort. But
I got overwhelmed with other tasks, and I may not get it done today.  I'll be in Tokyo next
week for Open Source Summit Japan.  If you are there, please try to catch me and say hi.
Given that, I'll see how soon I can provide the script I'm talking about, and we can
discuss the goals and design of the script.

A couple of quick things:
There are lots of things to discuss, but here are a few things to get started with...

= wiki account =
The wiki where we'll be maintaining information about 
boot time, and about activities of the boot time SIG, is the elinux wiki.
The page we'll be focusing on is: https://elinux.org/Boot_Time.
If you are interested in helping update and maintain the information there
(which I hope almost everyone is), then please make sure you have a user
account on the wiki.
If you don't have one, please go here:
https://elinux.org/Special:RequestAccount
I have to manually approve accounts in order to fight spambots.  It might
take a few days for me to get to your request.  It's very helpful if you
put a comment in one of the request fields about this being related to
the boot-time initiative or SIG, so I can distinguish your request from
spam requests.

= support for new developers =
A number of developers have asked me if they can participate and contribute,
even if they are not seasoned Linux kernel developers.  The answer is "Yes"!
I hope to provide a range of activities for people to provide data, help update
the wiki, implement and run tests and perform research - even if they don't have
any previous Linux development experience.  I hope it will be fun to participate,
and very educational.

If you are new to Linux and have just joined this group, please review some
of the material on the Boot_Time page mentioned above.  We will be covering
more than just the kernel in the project, but one place to get started will be
to look at the kernel source file init/main.c, particularly the function start_kernel()
(which is where a lot of the "magic" happens at kernel startup time.)
Don't be afraid to ask questions.  Please ask them on this list so that others benefit
from any answers provided.

= short-term plans =
I am building out the "membership" of the SIG over the very short term.  I have
some more individuals and companies to contact to see who wants to be involved.

Other things I'd like to do are:
 * start gathering boot timing data for different systems (using the script I described above)
 * start pruning obsolete information and refactoring the boot-time material on the elinux wiki
    * (Yes - some of the material there is quite dated, so be sure to check it out before you try to
       use some tool or technique - if something doesn't work, please send an e-mail or mark it in the wiki)
 * discuss planning for SIG video conference calls and meetings
     * I know I'm interested in having a boot-time micro-conference at Embedded Linux
     Conference next year - but we need to discuss if we want regular calls or other face-to-face
      meetings
 * perform a survey of existing boot-time reduction techniques, and see where they are
    in the pipeline of upstreaming or deployment in actual products
 * finally (for this list), brainstorm what activities the SIG should do, and how we can
    collaborate on those.  I've started a list at: https://elinux.org/Boot_Time_Project_Ideas
    that you can look at and comment on (either on this list, or on the wiki).

I'll be busy with business travel and Sony work next week, but I hope I still
find some time to follow up on this .  I look forward to working with many of you
reading this, on improving this area of Linux.
 -- Tim


^ permalink raw reply

* Boot-time presentations
From: Bird, Tim @ 2024-10-23 23:43 UTC (permalink / raw)
  To: linux-embedded@vger.kernel.org

Hello kernel developers,

I'm writing to inform you of some community-building initiatives I have planned
for embedded people interested in, or working on, boot time reduction for the kernel.

More specifically, I'm working on updating online resources on this topic, and
re-starting the use of some existing communication channels.

First thing: I've collected presentations and videos for the past several years on the topic
of boot time, on a page on the elinux wiki.
See this page: https://elinux.org/Boot_Time_Presentations

I plan to re-work and update the information on the boot time page on the elinux
wiki in the next few months. That page is here: https://elinux.org/Boot_Time

It needs a fair amount of work to update it for recent kernels and capture current areas of
activity.   But even some of the old information there is useful now.

Second thing: I'm also trying to collect a list of developers who are actively working in this area,
as well as active areas of instrumentation, testing, patches, and techniques.
If you are interested in engaging in discussions about Linux boot time, the main thing to do
is be subscribed to linux-embedded@vger.kernel.org list.  If you're reading this in an e-mail
client, instead of on lore, then you're all set.  If not, you should consider subscribing to that
list.  This linux-embedded mailing list is where I'll be announcing more about my work, and
trying to organize some collaboration in this space. You can also e-mail me privately if you
want to get on my list of "interested parties".

I'll be trying to set up some meetings and collaborative activities in the next few months.
I have a few patches I will be sending (such as my patch for 'deferred initcalls'), as well
as trying to build on some testing work that Collabora has already started (and releasing
my own tests of boot time, suitable for CI integration).

If you're in Japan, I'll be at Open Source Summit Japan in Tokyo next week.  And I'd love
to meet up with you to discuss current work and future plans.

This is a follow-up from discussions held at Linux Plumbers Conference this year.
I hope to talk to you soon.

 -- Tim Bird, Principal Software Engineer, Sony Electronics

^ permalink raw reply

* Request to join the list
From: Weyman Lo @ 2024-10-23 20:59 UTC (permalink / raw)
  To: linux-embedded


-- 
Weyman Lo
Codethink Ltd.
www.codethink.co.uk
https://www.linkedin.com/in/weymanlo/
mobile: +44 7810 530 880


^ permalink raw reply

* Join
From: Stephen Aaskov @ 2024-10-23 20:14 UTC (permalink / raw)
  To: linux-embedded


Sendt fra min iPhone

^ permalink raw reply

* Linux -Embedded
From: Muahmmad Salman @ 2024-10-23  6:13 UTC (permalink / raw)
  To: linux-embedded


Sent from my iPhone

^ permalink raw reply

* Wish to join this group. Please add me
From: priyaranjan @ 2024-10-23  5:24 UTC (permalink / raw)
  To: linux-embedded
In-Reply-To: <CAE_iR+jYEzf6M2vPTsXJEyAMJ+21uwHJL=c4U3S1cjF72gjjzQ@mail.gmail.com>



^ permalink raw reply

* I would like to join
From: Srikanth Valla @ 2024-10-23  4:45 UTC (permalink / raw)
  To: linux-embedded



Sent from my iPhone

^ permalink raw reply

* Yamaha Piano 10/12
From: Josey Swihart @ 2024-10-12 21:58 UTC (permalink / raw)
  To: linux-embedded

Hello,

I?m offering my late husband?s Yamaha piano to anyone who would truly appreciate it. If you or someone you know would be interested in receiving this instrument for free, please don?t hesitate to contact me.

Warm regards,
Josey

^ permalink raw reply

* [kernel 6.10.10][aarch64] PCIe Bridge - NVMe SSD - No SMMU (or IOMMU)
From: Lior Weintraub @ 2024-09-16 11:22 UTC (permalink / raw)
  To: linux-embedded@vger.kernel.org

Dear friends, 

I am running Linux kernel 6.10.10 on a Cortex A53 single CPU running @ 1.3GHz.
The CPU is part of a SoC with a PCIe bridge (root port) from Synopsis (using compatible = "snps,dw-pcie").
A Gen4 SSD is connected to the PCIe RP so when I run lspci I see both the PCIe bridge and the SSD:
# lspci
00:00.0 PCI bridge: Device 1e7e:abcd (rev 01)
01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5040 (rev 03)

The problem I am facing is that the NVMe driver fails to load:
[    0.862737][   T10] nvme nvme0: 1/0/0 default/read/poll queues
[    0.874457][    C0] could not locate request for tag 0xfff
[    0.879977][    C0] nvme nvme0: invalid id 65535 completed on queue 1
[   31.820058][    T8] nvme nvme0: I/O tag 128 (0080) opcode 0x2 (I/O Cmd) QID 1 timeout, aborting req_op:READ(0) size:4096
[   31.831882][    C0] nvme nvme0: Abort status: 0x0
[   62.540052][    T8] nvme nvme0: I/O tag 128 (0080) opcode 0x2 (I/O Cmd) QID 1 timeout, reset controller
[   62.596074][   T20] nvme nvme0: 1/0/0 default/read/poll queues
[   62.602059][    C0] could not locate request for tag 0xfff
[   62.607567][    C0] nvme nvme0: invalid id 65535 completed on queue 1
[   93.260066][    T8] nvme nvme0: I/O tag 129 (0081) QID 1 timeout, disable controller
[   93.274391][    T8] I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   93.283627][    T8] Buffer I/O error on dev nvme0n1, logical block 0, async page read
[   93.291676][   T20] nvme nvme0: failed to mark controller live state
[   93.298111][   T20] nvme nvme0: Disabling device after reset failure: -19
[   93.305094][   T20] Buffer I/O error on dev nvme0n1, logical block 0, async page read
[   93.313101][   T27]  nvme0n1: unable to read partition table

I found on drivers/nvme/host/pci.c that the address written to the device is the dma address which is different from the physical address.
Adding some prints to nvme_pci_configure_admin_queue:
[    0.825383][   T10] nvme nvme0: ----> sq_dma_addr 0xa36000 cq_dma_addr 0xa35000
[    0.832767][   T10] nvme nvme0: ----> phys sq = 0x8afd000 phys cq = 0x8af5000

After reading the DMA api howto page (https://docs.kernel.org/core-api/dma-api-howto.html) it is clear and understandable the use of Bus Address Space and IOMMU (or SMMU).
My SoC doesn't have SMMU and the mapping between the device and the physical address is 1:1 mapping.

So the question is:
How to disable the MMU and cause the kernel allocating dma addresses that match physical addresses.

Tried the following (each one separately) and none worked:
1. pass boot argument "iommu.passthrough=1"
2. Remove "IOMMU Hardware Support" from kernel configuration.
3. Enable "IOMMU Hardware Support" but set the CONFIG_IOMMU_DEFAULT_PASSTHROUGH=y

Thanks,
Lior.

^ permalink raw reply

* lowend platforms support such as Cortex-M
From: Andy Gao @ 2024-05-17 10:46 UTC (permalink / raw)
  To: linux-embedded

Hi,
I'm curious about the level of support that modern Linux offers for
low-end platforms that lack MMU components, such as the ARM Cortex-M
family. Specifically, I'm interested in the memory protection
capabilities that Linux provides on these platforms.
While the ARM Cortex-M family offers an MPU (memory protection unit),
I haven't seen many features related to it in the Linux kernel.
Please let me know if I'm mistaken, and any additional information
would be greatly appreciated. Thank you!

Best Regards,
Andy

^ permalink raw reply

* Re: Linux Kernel (megi patches for PinePhone Pro)
From: Tanvir Roshid @ 2024-03-12 16:34 UTC (permalink / raw)
  To: linux-embedded
In-Reply-To: <a5b7e74f-fad6-4a07-a2cc-614cdb75046e@codethink.co.uk>


On 12/03/2024 16:30, Tanvir Roshid wrote:
> Hi,
>
> I hope you are well.
>
> I wanted to post this message to discuss the megous kernel and 
> communicate with the embedded Linux community. This post is my first 
> attempt at using the Linux mailing list, so forgive me if I make any 
> mistakes.
>
> For context, the megous kernel is a fork of the Torvald kernel 
> containing patches to enable the PinePhone and PinePhone Pro to boot 
> correctly.
>
> The megous kernel disappeared earlier this year. We have spent the 
> better part of the year getting the phones to boot with the upstream 
> kernel for GNOME OS. We successfully confirmed working boards using 
> patches found on this repo:
> - 
> https://gitlab.com/pine64-org/linux/-/tree/linux-pinephonepro-6.6.y?ref_type=heads
>
> The work is visible here:
> - https://gitlab.gnome.org/GNOME/gnome-build-meta/-/merge_requests/2455
>
> I am aware that a new fork replacing the megous kernel exists here:
> - https://github.com/sailfish-on-dontbeevil/kernel-megi
>
> The GNOME community would prefer not to rely on a custom kernel and 
> use the upstream version to avoid a repeat of the megous kernel and 
> its disappearance. Recently, the patches have understandably failed to 
> apply to the new kernel. We would prefer not to upstream these patches 
> for long-term maintainability versus continuous maintenance.
Apologies; my sentence was not clear here. I mean to state "We would 
prefer to upstream these patches for long-term maintainability versus 
continuous maintenance. "
>
> My question to the embedded community is:
> - What is preventing the upstream kernel from integrating these patches?
>
> From research (https://news.ycombinator.com/item?id=30015412), I can 
> see that these patches present problems. However, we would like to 
> know more specifics to eventually upstream these patches via 
> additional work.
>
> Kind regards,
> Tanvir Roshid
>
>

^ permalink raw reply

* Linux Kernel (megi patches for PinePhone Pro)
From: Tanvir Roshid @ 2024-03-12 16:30 UTC (permalink / raw)
  To: linux-embedded

Hi,

I hope you are well.

I wanted to post this message to discuss the megous kernel and 
communicate with the embedded Linux community. This post is my first 
attempt at using the Linux mailing list, so forgive me if I make any 
mistakes.

For context, the megous kernel is a fork of the Torvald kernel 
containing patches to enable the PinePhone and PinePhone Pro to boot 
correctly.

The megous kernel disappeared earlier this year. We have spent the 
better part of the year getting the phones to boot with the upstream 
kernel for GNOME OS. We successfully confirmed working boards using 
patches found on this repo:
- 
https://gitlab.com/pine64-org/linux/-/tree/linux-pinephonepro-6.6.y?ref_type=heads

The work is visible here:
- https://gitlab.gnome.org/GNOME/gnome-build-meta/-/merge_requests/2455

I am aware that a new fork replacing the megous kernel exists here:
- https://github.com/sailfish-on-dontbeevil/kernel-megi

The GNOME community would prefer not to rely on a custom kernel and use 
the upstream version to avoid a repeat of the megous kernel and its 
disappearance. Recently, the patches have understandably failed to apply 
to the new kernel. We would prefer not to upstream these patches for 
long-term maintainability versus continuous maintenance.

My question to the embedded community is:
- What is preventing the upstream kernel from integrating these patches?

 From research (https://news.ycombinator.com/item?id=30015412), I can 
see that these patches present problems. However, we would like to know 
more specifics to eventually upstream these patches via additional work.

Kind regards,
Tanvir Roshid

^ permalink raw reply

* Yamaha grand piano 02/22/2024
From: Paula Mortalo @ 2024-02-22 19:35 UTC (permalink / raw)
  To: linux-embedded

Hello,

I'm offering my late husband's Yamaha Piano to any music enthusiast. If you or someone you know might value this instrument, please don't hesitate to reach out to me.

Warm regards,
Paula

^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-26  7:48 UTC (permalink / raw)
  To: hs@denx.de, Dirk Behme; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB055596BEB89D5CF1D515E738C39AA@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Update:
Issue with CPU idle was found.
It was related to our SoC changes in timers interrupt connectivity (which makes sense :-)).
Marry XMAS all.

> -----Original Message-----
> From: Lior Weintraub
> Sent: Sunday, December 24, 2023 9:12 PM
> To: hs@denx.de; Dirk Behme <dirk.behme@gmail.com>
> Cc: linux-embedded@vger.kernel.org
> Subject: RE: Debugging early SError exception
> 
> Update:
> UART issue ("unable to open an initial console") was resolved.
> I was missing CONFIG_SERIAL_8250_DW=y on my config.
> 
> Now only issue left is the CPU idle ("wfi") and no interrupts are coming.
> 
> > -----Original Message-----
> > From: Lior Weintraub
> > Sent: Sunday, December 24, 2023 5:42 PM
> > To: hs@denx.de; Dirk Behme <dirk.behme@gmail.com>
> > Cc: linux-embedded@vger.kernel.org
> > Subject: RE: Debugging early SError exception
> >
> > Hi,
> >
> > The GICv3 issue was resolved after:
> > 1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization
> support
> > and hence ICC_SRE_EL2 is not supported).
> > 2. Power up the GICR on EL3
> >
> > The earlycon issue was resolved after:
> > 1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot
> > args.
> > 2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only
> > CONFIG_SERIAL_8250=y)
> >
> > Now I face a new issue:
> > Linux boot hangs on "wait for interrupt" at cpu_do_idle.
> >
> > The program counter is stuck at 0xffff8000805ae45c.
> > ffff8000805ae454 <cpu_do_idle>:
> > ffff8000805ae454:       d5033f9f        dsb     sy
> > ffff8000805ae458:       d503207f        wfi
> > ffff8000805ae45c:       d65f03c0        ret
> >
> > I think that something is wrong with the timers or gic setting and as a result
> > the scheduler doesn't get the interrupts (timer ticks).
> >
> > Additional info that might be relevant to this issue:
> > The emulation platform runs at about 2.8MHz.
> > The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq
> > varies between 1.9-2.8MHz).
> > The reason for those settings is to allow Linux to run as it would on the "real"
> > world.
> >
> > It is my understanding that there are 2 issues here:
> > 1. Something is wrong with Timers\Interrupt setting (note that same
> > configuration runs correctly on QEMU)
> > 2. Something is wrong with initramfs - according kernel source it seems to
> fail
> > to open "/dev/console"
> >
> > The full Linux boot log:
> > Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> > Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023
> > Machine model: Pliops Spider MK-I EVK
> > earlycon: uart8250 at MMIO32 0x000000d000307000 (options
> '115200n8')
> > printk: bootconsole [uart8250] enabled
> > efi: UEFI not found.
> > Zone ranges:
> >   DMA      [mem 0x0000000000000000-0x000000002fffffff]
> >   DMA32    empty
> >   Normal   empty
> > Movable zone start for each node
> > Early memory node ranges
> >   node   0: [mem 0x0000000000000000-0x000000002fffffff]
> > Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
> > percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400
> > Detected VIPT I-cache on CPU0
> > CPU features: detected: GIC system register CPU interface
> > CPU features: detected: ARM erratum 845719
> > alternatives: applying boot alternatives
> > Kernel command line: console=ttyS0,115200n8
> > earlycon=uart8250,mmio32,0xd000307000,115200n8
> > Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
> > Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
> > Built 1 zonelists, mobility grouping on.  Total pages: 193536
> > mem auto-init: stack:off, heap alloc:off, heap free:off
> > software IO TLB: area num 1.
> > software IO TLB: mapped [mem 0x000000002b080000-
> > 0x000000002f080000] (64MB)
> > Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata,
> > 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)
> > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> > trace event string verifier disabled
> > rcu: Hierarchical RCU implementation.
> > rcu:    RCU event tracing is enabled.
> > rcu:    RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.
> > rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> > rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> > NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> > GICv3: 96 SPIs implemented
> > GICv3: 0 Extended SPIs implemented
> > Root IRQ handler: gic_handle_irq
> > GICv3: GICv3 features: 16 PPIs
> > GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000
> > ITS [mem 0xe000040000-0xe00005ffff]
> > ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz
> 8,
> > psz 64K, shr 1)
> > ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000
> > (flat, esz 2, psz 64K, shr 1)
> > GICv3: Expected reserved range
> > [0x00000000000c0000:0x00000000000cffff], not found
> > GICv3: using LPI property table @0x00000000000c0000
> > GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted
> > CPU0: Failed to disable LPIs
> > rcu: srcu_init: Setting srcu_struct sizes based on contention.
> > arch_timer: cp15 timer(s) running at 62.50MHz (virt).
> > clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles:
> > 0x1cd42e208c, max_idle_ns: 881590405314 ns
> > sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every
> > 4398046511096ns
> > Console: colour dummy device 80x25
> > Calibrating delay loop (skipped), value calculated using timer frequency..
> > 125.00 BogoMIPS (lpj=250000)
> > pid_max: default: 32768 minimum: 301
> > Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> > Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> > cacheinfo: Unable to detect cache hierarchy for CPU 0
> > rcu: Hierarchical SRCU implementation.
> > rcu:    Max phase no-delay instances is 1000.
> > Platform MSI: gic-its@E000040000 domain created
> > PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000
> > domain created
> > EFI services will not be available.
> > smp: Bringing up secondary CPUs ...
> > smp: Brought up 1 node, 1 CPU
> > SMP: Total of 1 processors activated.
> > CPU features: detected: 32-bit EL0 Support
> > CPU features: detected: CRC32 instructions
> > CPU: All CPU(s) started at EL1
> > alternatives: applying system-wide alternatives
> > devtmpfs: initialized
> > clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns:
> > 7645041785100000 ns
> > futex hash table entries: 256 (order: 2, 16384 bytes, linear)
> > DMI not present or invalid.
> > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic
> > allocations
> > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic
> > allocations
> > hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
> > ASID allocator initialised with 65536 entries
> > Serial: AMBA PL011 UART driver
> > Modules: 30080 pages in range for non-PLT usage
> > Modules: 521600 pages in range for PLT usage
> > iommu: Default domain type: Translated
> > iommu: DMA domain TLB invalidation policy: strict mode
> > SCSI subsystem initialized
> > vgaarb: loaded
> > clocksource: Switched to clocksource arch_sys_counter
> > PCI: CLS 0 bytes, default 64
> > workingset: timestamp_bits=46 max_order=18 bucket_order=0
> > fuse: init (API version 7.38)
> > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> > io scheduler mq-deadline registered
> > io scheduler kyber registered
> > Unpacking initramfs...
> > Freeing initrd memory: 4596K
> > Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> > hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters
> > available
> > clk: Disabling unused clocks
> > Warning: unable to open an initial console.
> > Freeing unused kernel memory: 1600K
> >
> > Thanks in advance for your great advice and support,
> > Cheers,
> > Lior.
> >
> > > -----Original Message-----
> > > From: Heiko Schocher <hs@denx.de>
> > > Sent: Friday, December 22, 2023 10:04 AM
> > > To: Dirk Behme <dirk.behme@gmail.com>; Lior Weintraub
> > > <liorw@pliops.com>
> > > Cc: linux-embedded@vger.kernel.org
> > > Subject: Re: Debugging early SError exception
> > >
> > > [You don't often get email from hs@denx.de. Learn why this is important
> at
> > > https://aka.ms/LearnAboutSenderIdentification ]
> > >
> > > CAUTION: External Sender
> > >
> > > Hello Dirk, Lior,
> > >
> > > On 22.12.23 08:48, Dirk Behme wrote:
> > > > Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> > > >> Hi,
> > > >>
> > > >> I managed to dump the __log_buf but for some reason the UART is still
> > not
> > > working.
> > > >> Please note that UART printed all the U-BOOT traces so AFAIU, the
> device
> > > tree is set correctly.
> > > >> (Barebox is passing it's DTB into kernel).
> > > >>
> > > >> To enable the earlyprintk I have:
> > > >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and
> > > CONFIG_DEBUG_LL=y
> > > >> 2. Modified the boot args to include: "console=ttyS0,115200n8
> > > earlycon=dw-apb-uart,0xd000307000"
> > > >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> > > >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart",
> > > early_serial8250_setup);
> > > >>
> > > >>  From __log_buf dump:
> > > >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> > > >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > > gcc.br_real (Buildroot
> > > >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107
> > > SMP Thu Dec 21 17:33:12 IST 202323
> > > >> Machine model: Pliops Spider MK-I EVKVK
> > > >> efi: UEFI not found.d.
> > > >> Zone ranges:s:
> > > >>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
> > > >>    DMA32    emptyty
> > > >>    Normal   emptyty
> > > >> Movable zone start for each nodede
> > > >> Early memory node rangeses
> > > >>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> > > >> Initmem setup node 0 [mem 0x0000000000000000-
> > > 0x000000002fffffff]f]
> > > >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> > > >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> > > >> pcpu-alloc: [0] 0
> > > >> Detected VIPT I-cache on CPU0U0
> > > >> CPU features: GIC system register CPU interface present but disabled by
> > > higher exception levelel
> > > >> CPU features: detected: ARM erratum 84571919
> > > >> alternatives: applying boot alternativeses
> > > >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-
> > > uart,0xd00030700000
> > > >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes,
> > linear)r)
> > > >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> > > >> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> > > >> mem auto-init: stack:off, heap alloc:off, heap free:offff
> > > >> software IO TLB: area num 1.1.
> > > >> software IO TLB: mapped [mem 0x000000002b080000-
> > > 0x000000002f080000] (64MB)B)
> > > >> Memory: 689240K/786432K available (5824K kernel code, 1186K
> > rwdata,
> > > 1612K rodata, 1600K init, 400K
> > > >> bss, 97192K reserved, 0K cma-reserved)d)
> > > >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> > > >> trace event string verifier disableded
> > > >> rcu: Hierarchical RCU implementation.n.
> > > >> rcu:     RCU event tracing is enabled.d.
> > > >> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> > > >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> > > >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> > > >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> > > >> GICv3: 96 SPIs implementeded
> > > >> GICv3: 0 Extended SPIs implementeded
> > > >> Root IRQ handler: gic_handle_irqrq
> > > >> GICv3: GICv3 features: 16 PPIsIs
> > > >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> > > >> GICv3: redistributor failed to wakeup.....
> > > >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> > > >
> > > > I think the two messages above are the essential ones.
> > >
> > > +1
> > >
> > > > Maybe it helps to check
> > > >
> > > > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6-
> > > aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6-
> > > DYArOlFkq__OwCChpFf-
> > nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0-
> > >
> >
> SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu
> > > Whhipp-
> > >
> >
> ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4
> > >
> >
> F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2
> > >
> F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html
> > > >
> > > > In the middle of that page in the "Call the kernel image" it has something
> > > about GIC:
> > > >
> > > > -- cut --
> > > > If the kernel is entered at EL1:
> > > >
> > > >         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> > > >         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> > > > -- cut --
> > >
> > > Also may it makes sense to check your firmware (bootloader, ATF?) ... may
> > > there is some setting missing for your SoC/Board ?
> > >
> > > bye,
> > > Heiko
> > >
> > > >
> > > >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1]
> > > SMPMP
> > > >> Modules linked in:
> > > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> > > >> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > >> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> > > >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> > > >> sp : ffff8000808f3b40
> > > >> x29: ffff8000808f3b40 x28: 0000000000000000 x27:
> > > 0000000000000001
> > > >> x26: ffff000000016040 x25: 0000000000000000 x24:
> > ffff800080a6b000
> > > >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> > > >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> > > >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> > > >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> > > >> x11: 452074612064656c x10: 6261736964282045 x9 :
> > > 6428204552532074
> > > >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> > > >> x5 : 000000000000002a x4 : 0000000000000000 x3 :
> > > 0000000000000000
> > > >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 :
> 000000000000003c
> > > >> Call trace:
> > > >>   gic_cpu_sys_reg_init+0x58/0x2e4
> > > >>   gic_cpu_init.part.0+0xa8/0x114
> > > >>   gic_init_bases+0x408/0x684
> > > >>   gic_of_init+0x298/0x300
> > > >>   of_irq_init+0x1c8/0x368
> > > >>   irqchip_init+0x14/0x1c
> > > >>   init_IRQ+0x98/0xac
> > > >>   start_kernel+0x250/0x5b8
> > > >>   __primary_switched+0xb4/0xbc
> > > >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> > > >> ---[ end trace 0000000000000000 ]-----
> > > >> Kernel panic - not syncing: Attempted to kill the idle task!k!
> > > >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> > > >>
> > > >>
> > > >> The kernel panic is related to GIC distributor (currently under debug) but
> > > AFAIU,
> > > >> this has nothing to do with the UART not working on early stages.
> > > >
> > > >
> > > > Yes, I agree. GIC issue and UART (at least the polling mode) should be
> > > indendent.
> > > >
> > > > Best regards
> > > >
> > > > Dirk
> > > >
> > > >
> > > >> Thanks in advanced for your advice,
> > > >> Cheers,
> > > >> Lior.
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Heiko Schocher <hs@denx.de>
> > > >>> Sent: Thursday, December 21, 2023 1:37 PM
> > > >>> To: Lior Weintraub <liorw@pliops.com>
> > > >>> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-
> > > embedded@vger.kernel.org
> > > >>> Subject: Re: Debugging early SError exception
> > > >>>
> > > >>> [You don't often get email from hs@denx.de. Learn why this is
> important
> > > at
> > > >>> https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>
> > > >>> CAUTION: External Sender
> > > >>>
> > > >>> Hi Lior,
> > > >>>
> > > >>> On 21.12.23 12:19, Dirk Behme wrote:
> > > >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> > > >>>>> Thanks Dirk,
> > > >>>>>
> > > >>>>> Regarding the earlyprintk, not sure I know how to make it work.
> > > >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
> > on
> > > my
> > > >>> config but it doesn't seem to work.
> > > >>>>> Do I need to pass something in the bootargs from the U-BOOT?
> > > >>>>> Do I need to add that into my device tree?
> > > >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> > > "chosen"
> > > >>> on my DT but it didn't
> > > >>>>> work)
> > > >>>>
> > > >>>> Yes, what has to be enabled and what not and what has to be set
> how
> > is
> > > often
> > > >>> confusing. I think this
> > > >>>> is not common for all systems, so I think to be on the safe side you
> > have
> > > to look
> > > >>> into the code for
> > > >>>> you system. Or short; The code is the documentation ;)
> > > >>>>
> > > >>>>
> > > >>>>> The UART I am using is "snps,dw-apb-uart".
> > > >>>>>
> > > >>>>> Last week, to output the early logs I have implemented this hack:
> > > >>>>> 1. Modify printk macro to run my print_func
> > > >>>>> 2. This print_func wrote the characters into a single global variable
> > (u32
> > > >>> simul_uart;)
> > > >>>>> 3. Get the address location of this global variable and extract all
> writes
> > to
> > > it
> > > >>> from the Tarmac
> > > >>>>> logs.
> > > >>>>>
> > > >>>>> This is a very slow and tedious process but it helped me identify the
> > > initial
> > > >>> SError.
> > > >>>>> Initially I thought I can write directly into the UART FIFO register
> > (which I
> > > know
> > > >>> the address)
> > > >>>>> but this didn't work because Linux already setup the MMU so I guess
> I
> > > need to
> > > >>> know the virtual
> > > >>>>> address of this FIFO.
> > > >>>>> Do I need to use __phys_to_virt of some sort?
> > > >>>>
> > > >>>> Yes, I think so. Have a look to the existing serial driver, too. It should
> do
> > > whats
> > > >>> needed, and you
> > > >>>> can borrow that, then.
> > > >>>
> > > >>> If you have access to the RAM after the crash (through a debugger or in
> > > >>> your bootloader) and your mem is stable, find out the address of
> > > __log_buf
> > > >>> in System.map. Thats the buffer where printk writes into it, and so
> > > dumping
> > > >>> the content is what you would see in case uart works...
> > > >>>
> > > >>> Hope it helps!
> > > >>>
> > > >>> bye,
> > > >>> Heiko
> > > >>>>
> > > >>>> Best regards
> > > >>>>
> > > >>>> Dirk
> > > >>>>
> > > >>>>
> > > >>>>> Cheers,
> > > >>>>> Lior.
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > > >>>>>> Sent: Thursday, December 21, 2023 10:30 AM
> > > >>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > > embedded@vger.kernel.org
> > > >>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>
> > > >>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why
> > > this is
> > > >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>
> > > >>>>>> CAUTION: External Sender
> > > >>>>>>
> > > >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> > > >>>>>>> Hi Dirk,
> > > >>>>>>>
> > > >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a
> U-
> > > BOOT
> > > >>>>>> v2).
> > > >>>>>>
> > > >>>>>> Glad to hear that! :)
> > > >>>>>>
> > > >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the
> > > UART Tx
> > > >>>>>> FIFO without checking if the FIFO is full.
> > > >>>>>>> Once the fifo got full it caused this SError probably because the
> > UART
> > > IP
> > > >>>>>> generated an apberror signal.
> > > >>>>>>
> > > >>>>>> Thanks for the report!
> > > >>>>>>
> > > >>>>>>> Now the Linux is running and doesn't report the SError again but
> > now
> > > we
> > > >>>>>> face another issue.
> > > >>>>>>> We see that the PC is getting into a "report_bug" function.
> > > >>>>>>> The Linux doesn't print anything to the UART (probably since it
> > hasn't
> > > got to
> > > >>>>>> the point where the console is configured?).
> > > >>>>>>
> > > >>>>>> For cases like this using earlyprintk is usually a good option. Check
> > > >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
> > > >>>>>> supports it. In the end it should be "just" a function in the serial
> > > >>>>>> console driver which outputs the console data via polling before
> > > >>>>>> (later) the interrupt driven console part takes over.
> > > >>>>>>
> > > >>>>>> Best regards
> > > >>>>>>
> > > >>>>>> Dirk
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> Since our debug means are limited it can take some time to find
> the
> > > root
> > > >>>>>> cause.
> > > >>>>>>>
> > > >>>>>>> I will keep you posted and update our findings.
> > > >>>>>>> Love to hear your thoughts,
> > > >>>>>>>
> > > >>>>>>> Cheers,
> > > >>>>>>> Lior.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > > >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> > > >>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > > embedded@vger.kernel.org
> > > >>>>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>>>
> > > >>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn
> > why
> > > this is
> > > >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>>>
> > > >>>>>>>> CAUTION: External Sender
> > > >>>>>>>>
> > > >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> > > >>>>>>>>> Thanks Dirk,
> > > >>>>>>>>
> > > >>>>>>>> Welcome :)
> > > >>>>>>>>
> > > >>>>>>>> In case you find the root cause it would be nice to get some
> generic
> > > >>>>>>>> description of it so that we can learn something :)
> > > >>>>>>>>
> > > >>>>>>>> Best regards
> > > >>>>>>>>
> > > >>>>>>>> Dirk
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>>> -----Original Message-----
> > > >>>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > > >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> > > >>>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > > >>>>>> embedded@vger.kernel.org
> > > >>>>>>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>>>>>
> > > >>>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn
> > > why this
> > > >>>>>> is
> > > >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> CAUTION: External Sender
> > > >>>>>>>>>>
> > > >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> > > >>>>>>>>>>> Hi,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> > > >>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
> > > >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> > > platform
> > > >>>>>>>>>> (Synopsys Zebu running our SoC model).
> > > >>>>>>>>>>> There is no debugger connected to this emulation but there
> are
> > > several
> > > >>>>>>>>>> debug capabilities we can use:
> > > >>>>>>>>>>> 1. Generating wave dump of CPU signals
> > > >>>>>>>>>>> 2. Generate a Tarmac log
> > > >>>>>>>>>>> 3. UART
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Since the SError happens at early stages of Linux boot the
> UART
> > > is not
> > > >>>>>>>>>> enabled yet.
> > > >>>>>>>>>>>      From the Tarmac log we can see:
> > > >>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O
> > > el1h_ns:   ret
> > > >>>>>>>>>> (parse_early_param)
> > > >>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O
> > > el1h_ns:   mov
> > > >>>>>>>> x0,
> > > >>>>>>>>>> #0xc0   //      #192    (setup_arch)
> > > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > > >>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O
> > > el1h_ns:   msr
> > > >>>>>>>>>> daif,   x0      (setup_arch)
> > > >>>>>>>>>>>                          R CPSR 600000c5
> > > >>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
> > > >>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with
> SP_ELx
> > > >>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
> > > >>>>>>>>>>>                          R CPSR 600003c5
> > > >>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
> > > >>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
> > > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O
> > > el1h_ns:   sub
> > > >>>>>>>> sp,
> > > >>>>>>>>>> sp,     #0x150  (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O
> > > el1h_ns:   add
> > > >>>>>>>> sp,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
> > > >>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O
> > > el1h_ns:   sub
> > > >>>>>>>> x0,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O
> > > el1h_ns:   tbnz
> > > >>>>>>>> w0,
> > > >>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> > > >>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O
> > > el1h_ns:   sub
> > > >>>>>>>> x0,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > > >>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O
> > > el1h_ns:   sub
> > > >>>>>> sp,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O
> > > el1h_ns:   b
> > > >>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> If I understand correctly, the exception happened sometime
> > > earlier
> > > >>> and
> > > >>>>>>>> only
> > > >>>>>>>>>> now Linux boot code (setup_arch) opened the exception
> > handling
> > > and as
> > > >>>>>> a
> > > >>>>>>>>>> result we immediately jump to the SError exception handler.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> > > >>>>>>>>>> running something "quite new" on some software (QEMU)
> and
> > > >>>>>> hardware
> > > >>>>>>>>>> (Synopsis) simulators.
> > > >>>>>>>>>>
> > > >>>>>>>>>> That would mean that you have new hardware with e.g. new
> > > memory
> > > >>>>>> map
> > > >>>>>>>>>> not used before. What you describe might sound like in the
> code
> > > before
> > > >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError.
> This
> > > >>>>>>>>>> might be an access to non-existing or non-enabled hardware.
> > I.e.
> > > it
> > > >>>>>>>>>> might be that you try to access (read/write) an address what is
> > > not
> > > >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case
> you
> > > >>>>>>>>>> are able to modify the code before Linux (the boot loader?)
> you
> > > might
> > > >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> > > >>>>>>>>>> with that make the search window smaller. I'm not that
> familiar
> > > with
> > > >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware
> accesses
> > > your
> > > >>>>>>>>>> code does. And with that analyse all accesses and with that
> > check
> > > if
> > > >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis)
> > > emulation
> > > >>>>>>>>>> system? That should be checked from valid address and from
> > > hardware
> > > >>>>>>>>>> subsystem enablement point of view.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hth,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Dirk
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>>      From the Linux source:
> > > >>>>>>>>>>>           parse_early_param();
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>           dynamic_scs_init();
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>           /*
> > > >>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up
> > > possible
> > > >>>>>>>>>>>            * earlycon. (Report possible System Errors once we can
> > > report
> > > >>> this
> > > >>>>>>>>>>>            * occurred).
> > > >>>>>>>>>>>            */
> > > >>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is
> > > when we
> > > >>>>>> get
> > > >>>>>>>> the
> > > >>>>>>>>>> exception.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract
> > the
> > > logs:
> > > >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000
> [0x410fd034]
> > > >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-
> > > linux-gnu-
> > > >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0,
> > GNU
> > > ld
> > > >>>>>> (GNU
> > > >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> > > >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> > > >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 --
> > SError
> > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS
> > > BTYPE=--)
> > > >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
> > > >>>>>>>>>>> lr : setup_arch+0x134/0x5ac
> > > >>>>>>>>>>> sp : ffff8000808f3da0
> > > >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> > > >>>>>>>>>> 0000000005e31b58c
> > > >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> > > >>>>>>>>>> ffff8000808f8000c
> > > >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> > > >>>>>>>> ffff800080010000c
> > > >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> > > >>>>>> 000000002266684ac
> > > >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> > > >>>>>>>>>> 0000000000000008c
> > > >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> > > >>>>>> 0000000000000003c
> > > >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> > > >>>>>>>> 0000000000000038c
> > > >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> > > >>>>>>>> 0000000000000001c
> > > >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> > > >>>>>>>>>> 0000000000000065c
> > > >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> > > >>>>>>>>>> 00000000000000c0c
> > > >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >>>>>>>>>>> Call trace:
> > > >>>>>>>>>>>       dump_backtrace+0x9c/0xd0
> > > >>>>>>>>>>>       show_stack+0x14/0x1c
> > > >>>>>>>>>>>       dump_stack_lvl+0x44/0x58
> > > >>>>>>>>>>>       dump_stack+0x14/0x1c
> > > >>>>>>>>>>>       panic+0x2e0/0x33c
> > > >>>>>>>>>>>       nmi_panic+0x68/0x6c
> > > >>>>>>>>>>>       arm64_serror_panic+0x68/0x78
> > > >>>>>>>>>>>       do_serror+0x24/0x54
> > > >>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
> > > >>>>>>>>>>>       el1h_64_error+0x64/0x68
> > > >>>>>>>>>>>       setup_arch+0x13c/0x5ac
> > > >>>>>>>>>>>       start_kernel+0x5c/0x5b8
> > > >>>>>>>>>>>       __primary_switched+0xb4/0xbc
> > > >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError
> > > Interrupt ]---
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Can you please advice how to proceed with debugging?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks in advanced,
> > > >>>>>>>>>>> Cheers,
> > > >>>>>>>>>>> Lior.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>> --
> > > >>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > > >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> > Germany
> > > >>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> > > hs@denx.de
> > > >
> > >
> > > --
> > > DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> > > Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> hs@denx.de


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-24 19:12 UTC (permalink / raw)
  To: hs@denx.de, Dirk Behme; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB0555FA423161E68EC71AC484C39AA@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Update:
UART issue ("unable to open an initial console") was resolved.
I was missing CONFIG_SERIAL_8250_DW=y on my config.

Now only issue left is the CPU idle ("wfi") and no interrupts are coming.

> -----Original Message-----
> From: Lior Weintraub
> Sent: Sunday, December 24, 2023 5:42 PM
> To: hs@denx.de; Dirk Behme <dirk.behme@gmail.com>
> Cc: linux-embedded@vger.kernel.org
> Subject: RE: Debugging early SError exception
> 
> Hi,
> 
> The GICv3 issue was resolved after:
> 1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization support
> and hence ICC_SRE_EL2 is not supported).
> 2. Power up the GICR on EL3
> 
> The earlycon issue was resolved after:
> 1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot
> args.
> 2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only
> CONFIG_SERIAL_8250=y)
> 
> Now I face a new issue:
> Linux boot hangs on "wait for interrupt" at cpu_do_idle.
> 
> The program counter is stuck at 0xffff8000805ae45c.
> ffff8000805ae454 <cpu_do_idle>:
> ffff8000805ae454:       d5033f9f        dsb     sy
> ffff8000805ae458:       d503207f        wfi
> ffff8000805ae45c:       d65f03c0        ret
> 
> I think that something is wrong with the timers or gic setting and as a result
> the scheduler doesn't get the interrupts (timer ticks).
> 
> Additional info that might be relevant to this issue:
> The emulation platform runs at about 2.8MHz.
> The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq
> varies between 1.9-2.8MHz).
> The reason for those settings is to allow Linux to run as it would on the "real"
> world.
> 
> It is my understanding that there are 2 issues here:
> 1. Something is wrong with Timers\Interrupt setting (note that same
> configuration runs correctly on QEMU)
> 2. Something is wrong with initramfs - according kernel source it seems to fail
> to open "/dev/console"
> 
> The full Linux boot log:
> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023
> Machine model: Pliops Spider MK-I EVK
> earlycon: uart8250 at MMIO32 0x000000d000307000 (options '115200n8')
> printk: bootconsole [uart8250] enabled
> efi: UEFI not found.
> Zone ranges:
>   DMA      [mem 0x0000000000000000-0x000000002fffffff]
>   DMA32    empty
>   Normal   empty
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x0000000000000000-0x000000002fffffff]
> Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400
> Detected VIPT I-cache on CPU0
> CPU features: detected: GIC system register CPU interface
> CPU features: detected: ARM erratum 845719
> alternatives: applying boot alternatives
> Kernel command line: console=ttyS0,115200n8
> earlycon=uart8250,mmio32,0xd000307000,115200n8
> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
> Built 1 zonelists, mobility grouping on.  Total pages: 193536
> mem auto-init: stack:off, heap alloc:off, heap free:off
> software IO TLB: area num 1.
> software IO TLB: mapped [mem 0x000000002b080000-
> 0x000000002f080000] (64MB)
> Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata,
> 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)
> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> trace event string verifier disabled
> rcu: Hierarchical RCU implementation.
> rcu:    RCU event tracing is enabled.
> rcu:    RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.
> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> GICv3: 96 SPIs implemented
> GICv3: 0 Extended SPIs implemented
> Root IRQ handler: gic_handle_irq
> GICv3: GICv3 features: 16 PPIs
> GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000
> ITS [mem 0xe000040000-0xe00005ffff]
> ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz 8,
> psz 64K, shr 1)
> ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000
> (flat, esz 2, psz 64K, shr 1)
> GICv3: Expected reserved range
> [0x00000000000c0000:0x00000000000cffff], not found
> GICv3: using LPI property table @0x00000000000c0000
> GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted
> CPU0: Failed to disable LPIs
> rcu: srcu_init: Setting srcu_struct sizes based on contention.
> arch_timer: cp15 timer(s) running at 62.50MHz (virt).
> clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles:
> 0x1cd42e208c, max_idle_ns: 881590405314 ns
> sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every
> 4398046511096ns
> Console: colour dummy device 80x25
> Calibrating delay loop (skipped), value calculated using timer frequency..
> 125.00 BogoMIPS (lpj=250000)
> pid_max: default: 32768 minimum: 301
> Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> cacheinfo: Unable to detect cache hierarchy for CPU 0
> rcu: Hierarchical SRCU implementation.
> rcu:    Max phase no-delay instances is 1000.
> Platform MSI: gic-its@E000040000 domain created
> PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000
> domain created
> EFI services will not be available.
> smp: Bringing up secondary CPUs ...
> smp: Brought up 1 node, 1 CPU
> SMP: Total of 1 processors activated.
> CPU features: detected: 32-bit EL0 Support
> CPU features: detected: CRC32 instructions
> CPU: All CPU(s) started at EL1
> alternatives: applying system-wide alternatives
> devtmpfs: initialized
> clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns:
> 7645041785100000 ns
> futex hash table entries: 256 (order: 2, 16384 bytes, linear)
> DMI not present or invalid.
> DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic
> allocations
> DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic
> allocations
> hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
> ASID allocator initialised with 65536 entries
> Serial: AMBA PL011 UART driver
> Modules: 30080 pages in range for non-PLT usage
> Modules: 521600 pages in range for PLT usage
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: strict mode
> SCSI subsystem initialized
> vgaarb: loaded
> clocksource: Switched to clocksource arch_sys_counter
> PCI: CLS 0 bytes, default 64
> workingset: timestamp_bits=46 max_order=18 bucket_order=0
> fuse: init (API version 7.38)
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> Unpacking initramfs...
> Freeing initrd memory: 4596K
> Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters
> available
> clk: Disabling unused clocks
> Warning: unable to open an initial console.
> Freeing unused kernel memory: 1600K
> 
> Thanks in advance for your great advice and support,
> Cheers,
> Lior.
> 
> > -----Original Message-----
> > From: Heiko Schocher <hs@denx.de>
> > Sent: Friday, December 22, 2023 10:04 AM
> > To: Dirk Behme <dirk.behme@gmail.com>; Lior Weintraub
> > <liorw@pliops.com>
> > Cc: linux-embedded@vger.kernel.org
> > Subject: Re: Debugging early SError exception
> >
> > [You don't often get email from hs@denx.de. Learn why this is important at
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > CAUTION: External Sender
> >
> > Hello Dirk, Lior,
> >
> > On 22.12.23 08:48, Dirk Behme wrote:
> > > Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> > >> Hi,
> > >>
> > >> I managed to dump the __log_buf but for some reason the UART is still
> not
> > working.
> > >> Please note that UART printed all the U-BOOT traces so AFAIU, the device
> > tree is set correctly.
> > >> (Barebox is passing it's DTB into kernel).
> > >>
> > >> To enable the earlyprintk I have:
> > >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and
> > CONFIG_DEBUG_LL=y
> > >> 2. Modified the boot args to include: "console=ttyS0,115200n8
> > earlycon=dw-apb-uart,0xd000307000"
> > >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> > >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart",
> > early_serial8250_setup);
> > >>
> > >>  From __log_buf dump:
> > >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> > >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > gcc.br_real (Buildroot
> > >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107
> > SMP Thu Dec 21 17:33:12 IST 202323
> > >> Machine model: Pliops Spider MK-I EVKVK
> > >> efi: UEFI not found.d.
> > >> Zone ranges:s:
> > >>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
> > >>    DMA32    emptyty
> > >>    Normal   emptyty
> > >> Movable zone start for each nodede
> > >> Early memory node rangeses
> > >>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> > >> Initmem setup node 0 [mem 0x0000000000000000-
> > 0x000000002fffffff]f]
> > >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> > >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> > >> pcpu-alloc: [0] 0
> > >> Detected VIPT I-cache on CPU0U0
> > >> CPU features: GIC system register CPU interface present but disabled by
> > higher exception levelel
> > >> CPU features: detected: ARM erratum 84571919
> > >> alternatives: applying boot alternativeses
> > >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-
> > uart,0xd00030700000
> > >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes,
> linear)r)
> > >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> > >> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> > >> mem auto-init: stack:off, heap alloc:off, heap free:offff
> > >> software IO TLB: area num 1.1.
> > >> software IO TLB: mapped [mem 0x000000002b080000-
> > 0x000000002f080000] (64MB)B)
> > >> Memory: 689240K/786432K available (5824K kernel code, 1186K
> rwdata,
> > 1612K rodata, 1600K init, 400K
> > >> bss, 97192K reserved, 0K cma-reserved)d)
> > >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> > >> trace event string verifier disableded
> > >> rcu: Hierarchical RCU implementation.n.
> > >> rcu:     RCU event tracing is enabled.d.
> > >> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> > >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> > >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> > >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> > >> GICv3: 96 SPIs implementeded
> > >> GICv3: 0 Extended SPIs implementeded
> > >> Root IRQ handler: gic_handle_irqrq
> > >> GICv3: GICv3 features: 16 PPIsIs
> > >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> > >> GICv3: redistributor failed to wakeup.....
> > >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> > >
> > > I think the two messages above are the essential ones.
> >
> > +1
> >
> > > Maybe it helps to check
> > >
> > > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6-
> > aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6-
> > DYArOlFkq__OwCChpFf-
> nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0-
> >
> SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu
> > Whhipp-
> >
> ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4
> >
> F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2
> > F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html
> > >
> > > In the middle of that page in the "Call the kernel image" it has something
> > about GIC:
> > >
> > > -- cut --
> > > If the kernel is entered at EL1:
> > >
> > >         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> > >         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> > > -- cut --
> >
> > Also may it makes sense to check your firmware (bootloader, ATF?) ... may
> > there is some setting missing for your SoC/Board ?
> >
> > bye,
> > Heiko
> >
> > >
> > >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1]
> > SMPMP
> > >> Modules linked in:
> > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> > >> Hardware name: Pliops Spider MK-I EVK (DT)
> > >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > >> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> > >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> > >> sp : ffff8000808f3b40
> > >> x29: ffff8000808f3b40 x28: 0000000000000000 x27:
> > 0000000000000001
> > >> x26: ffff000000016040 x25: 0000000000000000 x24:
> ffff800080a6b000
> > >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> > >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> > >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> > >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> > >> x11: 452074612064656c x10: 6261736964282045 x9 :
> > 6428204552532074
> > >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> > >> x5 : 000000000000002a x4 : 0000000000000000 x3 :
> > 0000000000000000
> > >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
> > >> Call trace:
> > >>   gic_cpu_sys_reg_init+0x58/0x2e4
> > >>   gic_cpu_init.part.0+0xa8/0x114
> > >>   gic_init_bases+0x408/0x684
> > >>   gic_of_init+0x298/0x300
> > >>   of_irq_init+0x1c8/0x368
> > >>   irqchip_init+0x14/0x1c
> > >>   init_IRQ+0x98/0xac
> > >>   start_kernel+0x250/0x5b8
> > >>   __primary_switched+0xb4/0xbc
> > >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> > >> ---[ end trace 0000000000000000 ]-----
> > >> Kernel panic - not syncing: Attempted to kill the idle task!k!
> > >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> > >>
> > >>
> > >> The kernel panic is related to GIC distributor (currently under debug) but
> > AFAIU,
> > >> this has nothing to do with the UART not working on early stages.
> > >
> > >
> > > Yes, I agree. GIC issue and UART (at least the polling mode) should be
> > indendent.
> > >
> > > Best regards
> > >
> > > Dirk
> > >
> > >
> > >> Thanks in advanced for your advice,
> > >> Cheers,
> > >> Lior.
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Heiko Schocher <hs@denx.de>
> > >>> Sent: Thursday, December 21, 2023 1:37 PM
> > >>> To: Lior Weintraub <liorw@pliops.com>
> > >>> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-
> > embedded@vger.kernel.org
> > >>> Subject: Re: Debugging early SError exception
> > >>>
> > >>> [You don't often get email from hs@denx.de. Learn why this is important
> > at
> > >>> https://aka.ms/LearnAboutSenderIdentification ]
> > >>>
> > >>> CAUTION: External Sender
> > >>>
> > >>> Hi Lior,
> > >>>
> > >>> On 21.12.23 12:19, Dirk Behme wrote:
> > >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> > >>>>> Thanks Dirk,
> > >>>>>
> > >>>>> Regarding the earlyprintk, not sure I know how to make it work.
> > >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
> on
> > my
> > >>> config but it doesn't seem to work.
> > >>>>> Do I need to pass something in the bootargs from the U-BOOT?
> > >>>>> Do I need to add that into my device tree?
> > >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> > "chosen"
> > >>> on my DT but it didn't
> > >>>>> work)
> > >>>>
> > >>>> Yes, what has to be enabled and what not and what has to be set how
> is
> > often
> > >>> confusing. I think this
> > >>>> is not common for all systems, so I think to be on the safe side you
> have
> > to look
> > >>> into the code for
> > >>>> you system. Or short; The code is the documentation ;)
> > >>>>
> > >>>>
> > >>>>> The UART I am using is "snps,dw-apb-uart".
> > >>>>>
> > >>>>> Last week, to output the early logs I have implemented this hack:
> > >>>>> 1. Modify printk macro to run my print_func
> > >>>>> 2. This print_func wrote the characters into a single global variable
> (u32
> > >>> simul_uart;)
> > >>>>> 3. Get the address location of this global variable and extract all writes
> to
> > it
> > >>> from the Tarmac
> > >>>>> logs.
> > >>>>>
> > >>>>> This is a very slow and tedious process but it helped me identify the
> > initial
> > >>> SError.
> > >>>>> Initially I thought I can write directly into the UART FIFO register
> (which I
> > know
> > >>> the address)
> > >>>>> but this didn't work because Linux already setup the MMU so I guess I
> > need to
> > >>> know the virtual
> > >>>>> address of this FIFO.
> > >>>>> Do I need to use __phys_to_virt of some sort?
> > >>>>
> > >>>> Yes, I think so. Have a look to the existing serial driver, too. It should do
> > whats
> > >>> needed, and you
> > >>>> can borrow that, then.
> > >>>
> > >>> If you have access to the RAM after the crash (through a debugger or in
> > >>> your bootloader) and your mem is stable, find out the address of
> > __log_buf
> > >>> in System.map. Thats the buffer where printk writes into it, and so
> > dumping
> > >>> the content is what you would see in case uart works...
> > >>>
> > >>> Hope it helps!
> > >>>
> > >>> bye,
> > >>> Heiko
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>> Dirk
> > >>>>
> > >>>>
> > >>>>> Cheers,
> > >>>>> Lior.
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > >>>>>> Sent: Thursday, December 21, 2023 10:30 AM
> > >>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > embedded@vger.kernel.org
> > >>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>
> > >>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why
> > this is
> > >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>
> > >>>>>> CAUTION: External Sender
> > >>>>>>
> > >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> > >>>>>>> Hi Dirk,
> > >>>>>>>
> > >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a U-
> > BOOT
> > >>>>>> v2).
> > >>>>>>
> > >>>>>> Glad to hear that! :)
> > >>>>>>
> > >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the
> > UART Tx
> > >>>>>> FIFO without checking if the FIFO is full.
> > >>>>>>> Once the fifo got full it caused this SError probably because the
> UART
> > IP
> > >>>>>> generated an apberror signal.
> > >>>>>>
> > >>>>>> Thanks for the report!
> > >>>>>>
> > >>>>>>> Now the Linux is running and doesn't report the SError again but
> now
> > we
> > >>>>>> face another issue.
> > >>>>>>> We see that the PC is getting into a "report_bug" function.
> > >>>>>>> The Linux doesn't print anything to the UART (probably since it
> hasn't
> > got to
> > >>>>>> the point where the console is configured?).
> > >>>>>>
> > >>>>>> For cases like this using earlyprintk is usually a good option. Check
> > >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
> > >>>>>> supports it. In the end it should be "just" a function in the serial
> > >>>>>> console driver which outputs the console data via polling before
> > >>>>>> (later) the interrupt driven console part takes over.
> > >>>>>>
> > >>>>>> Best regards
> > >>>>>>
> > >>>>>> Dirk
> > >>>>>>
> > >>>>>>
> > >>>>>>> Since our debug means are limited it can take some time to find the
> > root
> > >>>>>> cause.
> > >>>>>>>
> > >>>>>>> I will keep you posted and update our findings.
> > >>>>>>> Love to hear your thoughts,
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Lior.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> > >>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > embedded@vger.kernel.org
> > >>>>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>>>
> > >>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn
> why
> > this is
> > >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>>>
> > >>>>>>>> CAUTION: External Sender
> > >>>>>>>>
> > >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> > >>>>>>>>> Thanks Dirk,
> > >>>>>>>>
> > >>>>>>>> Welcome :)
> > >>>>>>>>
> > >>>>>>>> In case you find the root cause it would be nice to get some generic
> > >>>>>>>> description of it so that we can learn something :)
> > >>>>>>>>
> > >>>>>>>> Best regards
> > >>>>>>>>
> > >>>>>>>> Dirk
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> > >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> > >>>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> > >>>>>> embedded@vger.kernel.org
> > >>>>>>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>>>>>
> > >>>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn
> > why this
> > >>>>>> is
> > >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>>>>>
> > >>>>>>>>>> CAUTION: External Sender
> > >>>>>>>>>>
> > >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> > >>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
> > >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> > platform
> > >>>>>>>>>> (Synopsys Zebu running our SoC model).
> > >>>>>>>>>>> There is no debugger connected to this emulation but there are
> > several
> > >>>>>>>>>> debug capabilities we can use:
> > >>>>>>>>>>> 1. Generating wave dump of CPU signals
> > >>>>>>>>>>> 2. Generate a Tarmac log
> > >>>>>>>>>>> 3. UART
> > >>>>>>>>>>>
> > >>>>>>>>>>> Since the SError happens at early stages of Linux boot the UART
> > is not
> > >>>>>>>>>> enabled yet.
> > >>>>>>>>>>>      From the Tarmac log we can see:
> > >>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O
> > el1h_ns:   ret
> > >>>>>>>>>> (parse_early_param)
> > >>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O
> > el1h_ns:   mov
> > >>>>>>>> x0,
> > >>>>>>>>>> #0xc0   //      #192    (setup_arch)
> > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > >>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O
> > el1h_ns:   msr
> > >>>>>>>>>> daif,   x0      (setup_arch)
> > >>>>>>>>>>>                          R CPSR 600000c5
> > >>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
> > >>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with SP_ELx
> > >>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
> > >>>>>>>>>>>                          R CPSR 600003c5
> > >>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
> > >>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
> > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O
> > el1h_ns:   sub
> > >>>>>>>> sp,
> > >>>>>>>>>> sp,     #0x150  (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O
> > el1h_ns:   add
> > >>>>>>>> sp,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
> > >>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O
> > el1h_ns:   sub
> > >>>>>>>> x0,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O
> > el1h_ns:   tbnz
> > >>>>>>>> w0,
> > >>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> > >>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O
> > el1h_ns:   sub
> > >>>>>>>> x0,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > >>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O
> > el1h_ns:   sub
> > >>>>>> sp,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O
> > el1h_ns:   b
> > >>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> > >>>>>>>>>>>
> > >>>>>>>>>>> If I understand correctly, the exception happened sometime
> > earlier
> > >>> and
> > >>>>>>>> only
> > >>>>>>>>>> now Linux boot code (setup_arch) opened the exception
> handling
> > and as
> > >>>>>> a
> > >>>>>>>>>> result we immediately jump to the SError exception handler.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> > >>>>>>>>>> running something "quite new" on some software (QEMU) and
> > >>>>>> hardware
> > >>>>>>>>>> (Synopsis) simulators.
> > >>>>>>>>>>
> > >>>>>>>>>> That would mean that you have new hardware with e.g. new
> > memory
> > >>>>>> map
> > >>>>>>>>>> not used before. What you describe might sound like in the code
> > before
> > >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> > >>>>>>>>>> might be an access to non-existing or non-enabled hardware.
> I.e.
> > it
> > >>>>>>>>>> might be that you try to access (read/write) an address what is
> > not
> > >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> > >>>>>>>>>> are able to modify the code before Linux (the boot loader?) you
> > might
> > >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> > >>>>>>>>>> with that make the search window smaller. I'm not that familiar
> > with
> > >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses
> > your
> > >>>>>>>>>> code does. And with that analyse all accesses and with that
> check
> > if
> > >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis)
> > emulation
> > >>>>>>>>>> system? That should be checked from valid address and from
> > hardware
> > >>>>>>>>>> subsystem enablement point of view.
> > >>>>>>>>>>
> > >>>>>>>>>> Hth,
> > >>>>>>>>>>
> > >>>>>>>>>> Dirk
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>>      From the Linux source:
> > >>>>>>>>>>>           parse_early_param();
> > >>>>>>>>>>>
> > >>>>>>>>>>>           dynamic_scs_init();
> > >>>>>>>>>>>
> > >>>>>>>>>>>           /*
> > >>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up
> > possible
> > >>>>>>>>>>>            * earlycon. (Report possible System Errors once we can
> > report
> > >>> this
> > >>>>>>>>>>>            * occurred).
> > >>>>>>>>>>>            */
> > >>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is
> > when we
> > >>>>>> get
> > >>>>>>>> the
> > >>>>>>>>>> exception.
> > >>>>>>>>>>>
> > >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract
> the
> > logs:
> > >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-
> > linux-gnu-
> > >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0,
> GNU
> > ld
> > >>>>>> (GNU
> > >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> > >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> > >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 --
> SError
> > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
> > >>>>>>>>>>> lr : setup_arch+0x134/0x5ac
> > >>>>>>>>>>> sp : ffff8000808f3da0
> > >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> > >>>>>>>>>> 0000000005e31b58c
> > >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> > >>>>>>>>>> ffff8000808f8000c
> > >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> > >>>>>>>> ffff800080010000c
> > >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> > >>>>>> 000000002266684ac
> > >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> > >>>>>>>>>> 0000000000000008c
> > >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> > >>>>>> 0000000000000003c
> > >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> > >>>>>>>> 0000000000000038c
> > >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> > >>>>>>>> 0000000000000001c
> > >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> > >>>>>>>>>> 0000000000000065c
> > >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> > >>>>>>>>>> 00000000000000c0c
> > >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > >>>>>>>>>>> Call trace:
> > >>>>>>>>>>>       dump_backtrace+0x9c/0xd0
> > >>>>>>>>>>>       show_stack+0x14/0x1c
> > >>>>>>>>>>>       dump_stack_lvl+0x44/0x58
> > >>>>>>>>>>>       dump_stack+0x14/0x1c
> > >>>>>>>>>>>       panic+0x2e0/0x33c
> > >>>>>>>>>>>       nmi_panic+0x68/0x6c
> > >>>>>>>>>>>       arm64_serror_panic+0x68/0x78
> > >>>>>>>>>>>       do_serror+0x24/0x54
> > >>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
> > >>>>>>>>>>>       el1h_64_error+0x64/0x68
> > >>>>>>>>>>>       setup_arch+0x13c/0x5ac
> > >>>>>>>>>>>       start_kernel+0x5c/0x5b8
> > >>>>>>>>>>>       __primary_switched+0xb4/0xbc
> > >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError
> > Interrupt ]---
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can you please advice how to proceed with debugging?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks in advanced,
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Lior.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> Germany
> > >>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> > hs@denx.de
> > >
> >
> > --
> > DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> > Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-24 15:41 UTC (permalink / raw)
  To: hs@denx.de, Dirk Behme; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <411697ed-088c-2cae-d204-b510f9f909fe@denx.de>

Hi,

The GICv3 issue was resolved after:
1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization support and hence ICC_SRE_EL2 is not supported).
2. Power up the GICR on EL3

The earlycon issue was resolved after:
1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot args.
2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only CONFIG_SERIAL_8250=y)

Now I face a new issue:
Linux boot hangs on "wait for interrupt" at cpu_do_idle.

The program counter is stuck at 0xffff8000805ae45c.
ffff8000805ae454 <cpu_do_idle>:
ffff8000805ae454:       d5033f9f        dsb     sy
ffff8000805ae458:       d503207f        wfi
ffff8000805ae45c:       d65f03c0        ret

I think that something is wrong with the timers or gic setting and as a result the scheduler doesn't get the interrupts (timer ticks).

Additional info that might be relevant to this issue:
The emulation platform runs at about 2.8MHz.
The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq varies between 1.9-2.8MHz).
The reason for those settings is to allow Linux to run as it would on the "real" world.

It is my understanding that there are 2 issues here:
1. Something is wrong with Timers\Interrupt setting (note that same configuration runs correctly on QEMU)
2. Something is wrong with initramfs - according kernel source it seems to fail to open "/dev/console"

The full Linux boot log:
Booting Linux on physical CPU 0x0000000000 [0x410fd034]
Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023
Machine model: Pliops Spider MK-I EVK
earlycon: uart8250 at MMIO32 0x000000d000307000 (options '115200n8')
printk: bootconsole [uart8250] enabled
efi: UEFI not found.
Zone ranges:
  DMA      [mem 0x0000000000000000-0x000000002fffffff]
  DMA32    empty
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000000000-0x000000002fffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400
Detected VIPT I-cache on CPU0
CPU features: detected: GIC system register CPU interface
CPU features: detected: ARM erratum 845719
alternatives: applying boot alternatives
Kernel command line: console=ttyS0,115200n8 earlycon=uart8250,mmio32,0xd000307000,115200n8
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
Built 1 zonelists, mobility grouping on.  Total pages: 193536
mem auto-init: stack:off, heap alloc:off, heap free:off
software IO TLB: area num 1.
software IO TLB: mapped [mem 0x000000002b080000-0x000000002f080000] (64MB)
Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
trace event string verifier disabled
rcu: Hierarchical RCU implementation.
rcu:    RCU event tracing is enabled.
rcu:    RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.
rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
GICv3: 96 SPIs implemented
GICv3: 0 Extended SPIs implemented
Root IRQ handler: gic_handle_irq
GICv3: GICv3 features: 16 PPIs
GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000
ITS [mem 0xe000040000-0xe00005ffff]
ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz 8, psz 64K, shr 1)
ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000 (flat, esz 2, psz 64K, shr 1)
GICv3: Expected reserved range [0x00000000000c0000:0x00000000000cffff], not found
GICv3: using LPI property table @0x00000000000c0000
GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted
CPU0: Failed to disable LPIs
rcu: srcu_init: Setting srcu_struct sizes based on contention.
arch_timer: cp15 timer(s) running at 62.50MHz (virt).
clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles: 0x1cd42e208c, max_idle_ns: 881590405314 ns
sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every 4398046511096ns
Console: colour dummy device 80x25
Calibrating delay loop (skipped), value calculated using timer frequency.. 125.00 BogoMIPS (lpj=250000)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
cacheinfo: Unable to detect cache hierarchy for CPU 0
rcu: Hierarchical SRCU implementation.
rcu:    Max phase no-delay instances is 1000.
Platform MSI: gic-its@E000040000 domain created
PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000 domain created
EFI services will not be available.
smp: Bringing up secondary CPUs ...
smp: Brought up 1 node, 1 CPU
SMP: Total of 1 processors activated.
CPU features: detected: 32-bit EL0 Support
CPU features: detected: CRC32 instructions
CPU: All CPU(s) started at EL1
alternatives: applying system-wide alternatives
devtmpfs: initialized
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
futex hash table entries: 256 (order: 2, 16384 bytes, linear)
DMI not present or invalid.
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
ASID allocator initialised with 65536 entries
Serial: AMBA PL011 UART driver
Modules: 30080 pages in range for non-PLT usage
Modules: 521600 pages in range for PLT usage
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: strict mode
SCSI subsystem initialized
vgaarb: loaded
clocksource: Switched to clocksource arch_sys_counter
PCI: CLS 0 bytes, default 64
workingset: timestamp_bits=46 max_order=18 bucket_order=0
fuse: init (API version 7.38)
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
io scheduler mq-deadline registered
io scheduler kyber registered
Unpacking initramfs...
Freeing initrd memory: 4596K
Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters available
clk: Disabling unused clocks
Warning: unable to open an initial console.
Freeing unused kernel memory: 1600K

Thanks in advance for your great advice and support,
Cheers,
Lior.

> -----Original Message-----
> From: Heiko Schocher <hs@denx.de>
> Sent: Friday, December 22, 2023 10:04 AM
> To: Dirk Behme <dirk.behme@gmail.com>; Lior Weintraub
> <liorw@pliops.com>
> Cc: linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from hs@denx.de. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Hello Dirk, Lior,
> 
> On 22.12.23 08:48, Dirk Behme wrote:
> > Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> >> Hi,
> >>
> >> I managed to dump the __log_buf but for some reason the UART is still not
> working.
> >> Please note that UART printed all the U-BOOT traces so AFAIU, the device
> tree is set correctly.
> >> (Barebox is passing it's DTB into kernel).
> >>
> >> To enable the earlyprintk I have:
> >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and
> CONFIG_DEBUG_LL=y
> >> 2. Modified the boot args to include: "console=ttyS0,115200n8
> earlycon=dw-apb-uart,0xd000307000"
> >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart",
> early_serial8250_setup);
> >>
> >>  From __log_buf dump:
> >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> gcc.br_real (Buildroot
> >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107
> SMP Thu Dec 21 17:33:12 IST 202323
> >> Machine model: Pliops Spider MK-I EVKVK
> >> efi: UEFI not found.d.
> >> Zone ranges:s:
> >>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
> >>    DMA32    emptyty
> >>    Normal   emptyty
> >> Movable zone start for each nodede
> >> Early memory node rangeses
> >>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> >> Initmem setup node 0 [mem 0x0000000000000000-
> 0x000000002fffffff]f]
> >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> >> pcpu-alloc: [0] 0
> >> Detected VIPT I-cache on CPU0U0
> >> CPU features: GIC system register CPU interface present but disabled by
> higher exception levelel
> >> CPU features: detected: ARM erratum 84571919
> >> alternatives: applying boot alternativeses
> >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-
> uart,0xd00030700000
> >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)r)
> >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> >> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> >> mem auto-init: stack:off, heap alloc:off, heap free:offff
> >> software IO TLB: area num 1.1.
> >> software IO TLB: mapped [mem 0x000000002b080000-
> 0x000000002f080000] (64MB)B)
> >> Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata,
> 1612K rodata, 1600K init, 400K
> >> bss, 97192K reserved, 0K cma-reserved)d)
> >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> >> trace event string verifier disableded
> >> rcu: Hierarchical RCU implementation.n.
> >> rcu:     RCU event tracing is enabled.d.
> >> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> >> GICv3: 96 SPIs implementeded
> >> GICv3: 0 Extended SPIs implementeded
> >> Root IRQ handler: gic_handle_irqrq
> >> GICv3: GICv3 features: 16 PPIsIs
> >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> >> GICv3: redistributor failed to wakeup.....
> >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> >
> > I think the two messages above are the essential ones.
> 
> +1
> 
> > Maybe it helps to check
> >
> > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6-
> aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6-
> DYArOlFkq__OwCChpFf-nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0-
> SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu
> Whhipp-
> ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4
> F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2
> F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html
> >
> > In the middle of that page in the "Call the kernel image" it has something
> about GIC:
> >
> > -- cut --
> > If the kernel is entered at EL1:
> >
> >         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> >         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> > -- cut --
> 
> Also may it makes sense to check your firmware (bootloader, ATF?) ... may
> there is some setting missing for your SoC/Board ?
> 
> bye,
> Heiko
> 
> >
> >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1]
> SMPMP
> >> Modules linked in:
> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> >> Hardware name: Pliops Spider MK-I EVK (DT)
> >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> >> sp : ffff8000808f3b40
> >> x29: ffff8000808f3b40 x28: 0000000000000000 x27:
> 0000000000000001
> >> x26: ffff000000016040 x25: 0000000000000000 x24: ffff800080a6b000
> >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> >> x11: 452074612064656c x10: 6261736964282045 x9 :
> 6428204552532074
> >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> >> x5 : 000000000000002a x4 : 0000000000000000 x3 :
> 0000000000000000
> >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
> >> Call trace:
> >>   gic_cpu_sys_reg_init+0x58/0x2e4
> >>   gic_cpu_init.part.0+0xa8/0x114
> >>   gic_init_bases+0x408/0x684
> >>   gic_of_init+0x298/0x300
> >>   of_irq_init+0x1c8/0x368
> >>   irqchip_init+0x14/0x1c
> >>   init_IRQ+0x98/0xac
> >>   start_kernel+0x250/0x5b8
> >>   __primary_switched+0xb4/0xbc
> >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> >> ---[ end trace 0000000000000000 ]-----
> >> Kernel panic - not syncing: Attempted to kill the idle task!k!
> >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> >>
> >>
> >> The kernel panic is related to GIC distributor (currently under debug) but
> AFAIU,
> >> this has nothing to do with the UART not working on early stages.
> >
> >
> > Yes, I agree. GIC issue and UART (at least the polling mode) should be
> indendent.
> >
> > Best regards
> >
> > Dirk
> >
> >
> >> Thanks in advanced for your advice,
> >> Cheers,
> >> Lior.
> >>
> >>
> >>> -----Original Message-----
> >>> From: Heiko Schocher <hs@denx.de>
> >>> Sent: Thursday, December 21, 2023 1:37 PM
> >>> To: Lior Weintraub <liorw@pliops.com>
> >>> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-
> embedded@vger.kernel.org
> >>> Subject: Re: Debugging early SError exception
> >>>
> >>> [You don't often get email from hs@denx.de. Learn why this is important
> at
> >>> https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> CAUTION: External Sender
> >>>
> >>> Hi Lior,
> >>>
> >>> On 21.12.23 12:19, Dirk Behme wrote:
> >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> >>>>> Thanks Dirk,
> >>>>>
> >>>>> Regarding the earlyprintk, not sure I know how to make it work.
> >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on
> my
> >>> config but it doesn't seem to work.
> >>>>> Do I need to pass something in the bootargs from the U-BOOT?
> >>>>> Do I need to add that into my device tree?
> >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> "chosen"
> >>> on my DT but it didn't
> >>>>> work)
> >>>>
> >>>> Yes, what has to be enabled and what not and what has to be set how is
> often
> >>> confusing. I think this
> >>>> is not common for all systems, so I think to be on the safe side you have
> to look
> >>> into the code for
> >>>> you system. Or short; The code is the documentation ;)
> >>>>
> >>>>
> >>>>> The UART I am using is "snps,dw-apb-uart".
> >>>>>
> >>>>> Last week, to output the early logs I have implemented this hack:
> >>>>> 1. Modify printk macro to run my print_func
> >>>>> 2. This print_func wrote the characters into a single global variable (u32
> >>> simul_uart;)
> >>>>> 3. Get the address location of this global variable and extract all writes to
> it
> >>> from the Tarmac
> >>>>> logs.
> >>>>>
> >>>>> This is a very slow and tedious process but it helped me identify the
> initial
> >>> SError.
> >>>>> Initially I thought I can write directly into the UART FIFO register (which I
> know
> >>> the address)
> >>>>> but this didn't work because Linux already setup the MMU so I guess I
> need to
> >>> know the virtual
> >>>>> address of this FIFO.
> >>>>> Do I need to use __phys_to_virt of some sort?
> >>>>
> >>>> Yes, I think so. Have a look to the existing serial driver, too. It should do
> whats
> >>> needed, and you
> >>>> can borrow that, then.
> >>>
> >>> If you have access to the RAM after the crash (through a debugger or in
> >>> your bootloader) and your mem is stable, find out the address of
> __log_buf
> >>> in System.map. Thats the buffer where printk writes into it, and so
> dumping
> >>> the content is what you would see in case uart works...
> >>>
> >>> Hope it helps!
> >>>
> >>> bye,
> >>> Heiko
> >>>>
> >>>> Best regards
> >>>>
> >>>> Dirk
> >>>>
> >>>>
> >>>>> Cheers,
> >>>>> Lior.
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>> Sent: Thursday, December 21, 2023 10:30 AM
> >>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> embedded@vger.kernel.org
> >>>>>> Subject: Re: Debugging early SError exception
> >>>>>>
> >>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why
> this is
> >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>
> >>>>>> CAUTION: External Sender
> >>>>>>
> >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> >>>>>>> Hi Dirk,
> >>>>>>>
> >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a U-
> BOOT
> >>>>>> v2).
> >>>>>>
> >>>>>> Glad to hear that! :)
> >>>>>>
> >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the
> UART Tx
> >>>>>> FIFO without checking if the FIFO is full.
> >>>>>>> Once the fifo got full it caused this SError probably because the UART
> IP
> >>>>>> generated an apberror signal.
> >>>>>>
> >>>>>> Thanks for the report!
> >>>>>>
> >>>>>>> Now the Linux is running and doesn't report the SError again but now
> we
> >>>>>> face another issue.
> >>>>>>> We see that the PC is getting into a "report_bug" function.
> >>>>>>> The Linux doesn't print anything to the UART (probably since it hasn't
> got to
> >>>>>> the point where the console is configured?).
> >>>>>>
> >>>>>> For cases like this using earlyprintk is usually a good option. Check
> >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
> >>>>>> supports it. In the end it should be "just" a function in the serial
> >>>>>> console driver which outputs the console data via polling before
> >>>>>> (later) the interrupt driven console part takes over.
> >>>>>>
> >>>>>> Best regards
> >>>>>>
> >>>>>> Dirk
> >>>>>>
> >>>>>>
> >>>>>>> Since our debug means are limited it can take some time to find the
> root
> >>>>>> cause.
> >>>>>>>
> >>>>>>> I will keep you posted and update our findings.
> >>>>>>> Love to hear your thoughts,
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Lior.
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> >>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> embedded@vger.kernel.org
> >>>>>>>> Subject: Re: Debugging early SError exception
> >>>>>>>>
> >>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why
> this is
> >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>>>
> >>>>>>>> CAUTION: External Sender
> >>>>>>>>
> >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> >>>>>>>>> Thanks Dirk,
> >>>>>>>>
> >>>>>>>> Welcome :)
> >>>>>>>>
> >>>>>>>> In case you find the root cause it would be nice to get some generic
> >>>>>>>> description of it so that we can learn something :)
> >>>>>>>>
> >>>>>>>> Best regards
> >>>>>>>>
> >>>>>>>> Dirk
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> >>>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> >>>>>> embedded@vger.kernel.org
> >>>>>>>>>> Subject: Re: Debugging early SError exception
> >>>>>>>>>>
> >>>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn
> why this
> >>>>>> is
> >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>>>>>
> >>>>>>>>>> CAUTION: External Sender
> >>>>>>>>>>
> >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> >>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
> >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> platform
> >>>>>>>>>> (Synopsys Zebu running our SoC model).
> >>>>>>>>>>> There is no debugger connected to this emulation but there are
> several
> >>>>>>>>>> debug capabilities we can use:
> >>>>>>>>>>> 1. Generating wave dump of CPU signals
> >>>>>>>>>>> 2. Generate a Tarmac log
> >>>>>>>>>>> 3. UART
> >>>>>>>>>>>
> >>>>>>>>>>> Since the SError happens at early stages of Linux boot the UART
> is not
> >>>>>>>>>> enabled yet.
> >>>>>>>>>>>      From the Tarmac log we can see:
> >>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O
> el1h_ns:   ret
> >>>>>>>>>> (parse_early_param)
> >>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O
> el1h_ns:   mov
> >>>>>>>> x0,
> >>>>>>>>>> #0xc0   //      #192    (setup_arch)
> >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O
> el1h_ns:   msr
> >>>>>>>>>> daif,   x0      (setup_arch)
> >>>>>>>>>>>                          R CPSR 600000c5
> >>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
> >>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
> >>>>>>>>>>>                          R CPSR 600003c5
> >>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
> >>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O
> el1h_ns:   sub
> >>>>>>>> sp,
> >>>>>>>>>> sp,     #0x150  (vectors)
> >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O
> el1h_ns:   add
> >>>>>>>> sp,
> >>>>>>>>>> sp,     x0      (vectors)
> >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O
> el1h_ns:   sub
> >>>>>>>> x0,
> >>>>>>>>>> sp,     x0      (vectors)
> >>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
> >>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O
> el1h_ns:   tbnz
> >>>>>>>> w0,
> >>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O
> el1h_ns:   sub
> >>>>>>>> x0,
> >>>>>>>>>> sp,     x0      (vectors)
> >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O
> el1h_ns:   sub
> >>>>>> sp,
> >>>>>>>>>> sp,     x0      (vectors)
> >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O
> el1h_ns:   b
> >>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> >>>>>>>>>>>
> >>>>>>>>>>> If I understand correctly, the exception happened sometime
> earlier
> >>> and
> >>>>>>>> only
> >>>>>>>>>> now Linux boot code (setup_arch) opened the exception handling
> and as
> >>>>>> a
> >>>>>>>>>> result we immediately jump to the SError exception handler.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> >>>>>>>>>> running something "quite new" on some software (QEMU) and
> >>>>>> hardware
> >>>>>>>>>> (Synopsis) simulators.
> >>>>>>>>>>
> >>>>>>>>>> That would mean that you have new hardware with e.g. new
> memory
> >>>>>> map
> >>>>>>>>>> not used before. What you describe might sound like in the code
> before
> >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> >>>>>>>>>> might be an access to non-existing or non-enabled hardware. I.e.
> it
> >>>>>>>>>> might be that you try to access (read/write) an address what is
> not
> >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> >>>>>>>>>> are able to modify the code before Linux (the boot loader?) you
> might
> >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> >>>>>>>>>> with that make the search window smaller. I'm not that familiar
> with
> >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses
> your
> >>>>>>>>>> code does. And with that analyse all accesses and with that check
> if
> >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis)
> emulation
> >>>>>>>>>> system? That should be checked from valid address and from
> hardware
> >>>>>>>>>> subsystem enablement point of view.
> >>>>>>>>>>
> >>>>>>>>>> Hth,
> >>>>>>>>>>
> >>>>>>>>>> Dirk
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>      From the Linux source:
> >>>>>>>>>>>           parse_early_param();
> >>>>>>>>>>>
> >>>>>>>>>>>           dynamic_scs_init();
> >>>>>>>>>>>
> >>>>>>>>>>>           /*
> >>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up
> possible
> >>>>>>>>>>>            * earlycon. (Report possible System Errors once we can
> report
> >>> this
> >>>>>>>>>>>            * occurred).
> >>>>>>>>>>>            */
> >>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is
> when we
> >>>>>> get
> >>>>>>>> the
> >>>>>>>>>> exception.
> >>>>>>>>>>>
> >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract the
> logs:
> >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-
> linux-gnu-
> >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU
> ld
> >>>>>> (GNU
> >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
> >>>>>>>>>>> lr : setup_arch+0x134/0x5ac
> >>>>>>>>>>> sp : ffff8000808f3da0
> >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >>>>>>>>>> 0000000005e31b58c
> >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >>>>>>>>>> ffff8000808f8000c
> >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> >>>>>>>> ffff800080010000c
> >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> >>>>>> 000000002266684ac
> >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >>>>>>>>>> 0000000000000008c
> >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> >>>>>> 0000000000000003c
> >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> >>>>>>>> 0000000000000038c
> >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> >>>>>>>> 0000000000000001c
> >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >>>>>>>>>> 0000000000000065c
> >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >>>>>>>>>> 00000000000000c0c
> >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>>>>> Call trace:
> >>>>>>>>>>>       dump_backtrace+0x9c/0xd0
> >>>>>>>>>>>       show_stack+0x14/0x1c
> >>>>>>>>>>>       dump_stack_lvl+0x44/0x58
> >>>>>>>>>>>       dump_stack+0x14/0x1c
> >>>>>>>>>>>       panic+0x2e0/0x33c
> >>>>>>>>>>>       nmi_panic+0x68/0x6c
> >>>>>>>>>>>       arm64_serror_panic+0x68/0x78
> >>>>>>>>>>>       do_serror+0x24/0x54
> >>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
> >>>>>>>>>>>       el1h_64_error+0x64/0x68
> >>>>>>>>>>>       setup_arch+0x13c/0x5ac
> >>>>>>>>>>>       start_kernel+0x5c/0x5b8
> >>>>>>>>>>>       __primary_switched+0xb4/0xbc
> >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError
> Interrupt ]---
> >>>>>>>>>>>
> >>>>>>>>>>> Can you please advice how to proceed with debugging?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks in advanced,
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Lior.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>> --
> >>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> >>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> hs@denx.de
> >
> 
> --
> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de


^ permalink raw reply

* Re: Debugging early SError exception
From: Heiko Schocher @ 2023-12-22  8:04 UTC (permalink / raw)
  To: Dirk Behme, Lior Weintraub; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <e9e8a7db-3ff9-44c6-aa00-2d42a1aafea5@gmail.com>

Hello Dirk, Lior,

On 22.12.23 08:48, Dirk Behme wrote:
> Am 22.12.23 um 08:03 schrieb Lior Weintraub:
>> Hi,
>>
>> I managed to dump the __log_buf but for some reason the UART is still not working.
>> Please note that UART printed all the U-BOOT traces so AFAIU, the device tree is set correctly.
>> (Barebox is passing it's DTB into kernel).
>>
>> To enable the earlyprintk I have:
>> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
>> 2. Modified the boot args to include: "console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd000307000"
>> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
>> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart", early_serial8250_setup);
>>
>>  From __log_buf dump:
>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
>> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot
>> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107 SMP Thu Dec 21 17:33:12 IST 202323
>> Machine model: Pliops Spider MK-I EVKVK
>> efi: UEFI not found.d.
>> Zone ranges:s:
>>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
>>    DMA32    emptyty
>>    Normal   emptyty
>> Movable zone start for each nodede
>> Early memory node rangeses
>>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
>> Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]f]
>> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
>> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
>> pcpu-alloc: [0] 0
>> Detected VIPT I-cache on CPU0U0
>> CPU features: GIC system register CPU interface present but disabled by higher exception levelel
>> CPU features: detected: ARM erratum 84571919
>> alternatives: applying boot alternativeses
>> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd00030700000
>> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)r)
>> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
>> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
>> mem auto-init: stack:off, heap alloc:off, heap free:offff
>> software IO TLB: area num 1.1.
>> software IO TLB: mapped [mem 0x000000002b080000-0x000000002f080000] (64MB)B)
>> Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, 1612K rodata, 1600K init, 400K
>> bss, 97192K reserved, 0K cma-reserved)d)
>> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
>> trace event string verifier disableded
>> rcu: Hierarchical RCU implementation.n.
>> rcu:     RCU event tracing is enabled.d.
>> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
>> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
>> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
>> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
>> GICv3: 96 SPIs implementeded
>> GICv3: 0 Extended SPIs implementeded
>> Root IRQ handler: gic_handle_irqrq
>> GICv3: GICv3 features: 16 PPIsIs
>> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
>> GICv3: redistributor failed to wakeup.....
>> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> 
> I think the two messages above are the essential ones.

+1

> Maybe it helps to check
> 
> https://www.kernel.org/doc/html/v5.3/arm64/booting.html
> 
> In the middle of that page in the "Call the kernel image" it has something about GIC:
> 
> -- cut --
> If the kernel is entered at EL1:
> 
>         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
>         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> -- cut --

Also may it makes sense to check your firmware (bootloader, ATF?) ... may
there is some setting missing for your SoC/Board ?

bye,
Heiko

> 
>> Internal error: Oops - Undefined instruction: 0000000062383019 [#1] SMPMP
>> Modules linked in:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
>> Hardware name: Pliops Spider MK-I EVK (DT)
>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : gic_cpu_sys_reg_init+0x58/0x2e4
>> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
>> sp : ffff8000808f3b40
>> x29: ffff8000808f3b40 x28: 0000000000000000 x27: 0000000000000001
>> x26: ffff000000016040 x25: 0000000000000000 x24: ffff800080a6b000
>> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
>> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
>> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
>> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
>> x11: 452074612064656c x10: 6261736964282045 x9 : 6428204552532074
>> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
>> x5 : 000000000000002a x4 : 0000000000000000 x3 : 0000000000000000
>> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
>> Call trace:
>>   gic_cpu_sys_reg_init+0x58/0x2e4
>>   gic_cpu_init.part.0+0xa8/0x114
>>   gic_init_bases+0x408/0x684
>>   gic_of_init+0x298/0x300
>>   of_irq_init+0x1c8/0x368
>>   irqchip_init+0x14/0x1c
>>   init_IRQ+0x98/0xac
>>   start_kernel+0x250/0x5b8
>>   __primary_switched+0xb4/0xbc
>> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
>> ---[ end trace 0000000000000000 ]-----
>> Kernel panic - not syncing: Attempted to kill the idle task!k!
>> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
>>
>>
>> The kernel panic is related to GIC distributor (currently under debug) but AFAIU,
>> this has nothing to do with the UART not working on early stages.
> 
> 
> Yes, I agree. GIC issue and UART (at least the polling mode) should be indendent.
> 
> Best regards
> 
> Dirk
> 
> 
>> Thanks in advanced for your advice,
>> Cheers,
>> Lior.
>>  
>>
>>> -----Original Message-----
>>> From: Heiko Schocher <hs@denx.de>
>>> Sent: Thursday, December 21, 2023 1:37 PM
>>> To: Lior Weintraub <liorw@pliops.com>
>>> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-embedded@vger.kernel.org
>>> Subject: Re: Debugging early SError exception
>>>
>>> [You don't often get email from hs@denx.de. Learn why this is important at
>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> CAUTION: External Sender
>>>
>>> Hi Lior,
>>>
>>> On 21.12.23 12:19, Dirk Behme wrote:
>>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
>>>>> Thanks Dirk,
>>>>>
>>>>> Regarding the earlyprintk, not sure I know how to make it work.
>>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my
>>> config but it doesn't seem to work.
>>>>> Do I need to pass something in the bootargs from the U-BOOT?
>>>>> Do I need to add that into my device tree?
>>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen"
>>> on my DT but it didn't
>>>>> work)
>>>>
>>>> Yes, what has to be enabled and what not and what has to be set how is often
>>> confusing. I think this
>>>> is not common for all systems, so I think to be on the safe side you have to look
>>> into the code for
>>>> you system. Or short; The code is the documentation ;)
>>>>
>>>>
>>>>> The UART I am using is "snps,dw-apb-uart".
>>>>>
>>>>> Last week, to output the early logs I have implemented this hack:
>>>>> 1. Modify printk macro to run my print_func
>>>>> 2. This print_func wrote the characters into a single global variable (u32
>>> simul_uart;)
>>>>> 3. Get the address location of this global variable and extract all writes to it
>>> from the Tarmac
>>>>> logs.
>>>>>
>>>>> This is a very slow and tedious process but it helped me identify the initial
>>> SError.
>>>>> Initially I thought I can write directly into the UART FIFO register (which I know
>>> the address)
>>>>> but this didn't work because Linux already setup the MMU so I guess I need to
>>> know the virtual
>>>>> address of this FIFO.
>>>>> Do I need to use __phys_to_virt of some sort?
>>>>
>>>> Yes, I think so. Have a look to the existing serial driver, too. It should do whats
>>> needed, and you
>>>> can borrow that, then.
>>>
>>> If you have access to the RAM after the crash (through a debugger or in
>>> your bootloader) and your mem is stable, find out the address of __log_buf
>>> in System.map. Thats the buffer where printk writes into it, and so dumping
>>> the content is what you would see in case uart works...
>>>
>>> Hope it helps!
>>>
>>> bye,
>>> Heiko
>>>>
>>>> Best regards
>>>>
>>>> Dirk
>>>>
>>>>
>>>>> Cheers,
>>>>> Lior.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>> Sent: Thursday, December 21, 2023 10:30 AM
>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>>>> Subject: Re: Debugging early SError exception
>>>>>>
>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>
>>>>>> CAUTION: External Sender
>>>>>>
>>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
>>>>>>> Hi Dirk,
>>>>>>>
>>>>>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
>>>>>> v2).
>>>>>>
>>>>>> Glad to hear that! :)
>>>>>>
>>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
>>>>>> FIFO without checking if the FIFO is full.
>>>>>>> Once the fifo got full it caused this SError probably because the UART IP
>>>>>> generated an apberror signal.
>>>>>>
>>>>>> Thanks for the report!
>>>>>>
>>>>>>> Now the Linux is running and doesn't report the SError again but now we
>>>>>> face another issue.
>>>>>>> We see that the PC is getting into a "report_bug" function.
>>>>>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
>>>>>> the point where the console is configured?).
>>>>>>
>>>>>> For cases like this using earlyprintk is usually a good option. Check
>>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
>>>>>> supports it. In the end it should be "just" a function in the serial
>>>>>> console driver which outputs the console data via polling before
>>>>>> (later) the interrupt driven console part takes over.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Dirk
>>>>>>
>>>>>>
>>>>>>> Since our debug means are limited it can take some time to find the root
>>>>>> cause.
>>>>>>>
>>>>>>> I will keep you posted and update our findings.
>>>>>>> Love to hear your thoughts,
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Lior.
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>>>>>> Subject: Re: Debugging early SError exception
>>>>>>>>
>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>>
>>>>>>>> CAUTION: External Sender
>>>>>>>>
>>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
>>>>>>>>> Thanks Dirk,
>>>>>>>>
>>>>>>>> Welcome :)
>>>>>>>>
>>>>>>>> In case you find the root cause it would be nice to get some generic
>>>>>>>> description of it so that we can learn something :)
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> Dirk
>>>>>>>>
>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
>>>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
>>>>>> embedded@vger.kernel.org
>>>>>>>>>> Subject: Re: Debugging early SError exception
>>>>>>>>>>
>>>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
>>>>>> is
>>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>>>>
>>>>>>>>>> CAUTION: External Sender
>>>>>>>>>>
>>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
>>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
>>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
>>>>>>>>>> (Synopsys Zebu running our SoC model).
>>>>>>>>>>> There is no debugger connected to this emulation but there are several
>>>>>>>>>> debug capabilities we can use:
>>>>>>>>>>> 1. Generating wave dump of CPU signals
>>>>>>>>>>> 2. Generate a Tarmac log
>>>>>>>>>>> 3. UART
>>>>>>>>>>>
>>>>>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
>>>>>>>>>> enabled yet.
>>>>>>>>>>>      From the Tarmac log we can see:
>>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>>>>>>>>>> (parse_early_param)
>>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
>>>>>>>> x0,
>>>>>>>>>> #0xc0   //      #192    (setup_arch)
>>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
>>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>>>>>>>>>> daif,   x0      (setup_arch)
>>>>>>>>>>>                          R CPSR 600000c5
>>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
>>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with SP_ELx
>>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
>>>>>>>>>>>                          R CPSR 600003c5
>>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
>>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
>>>>>>>> sp,
>>>>>>>>>> sp,     #0x150  (vectors)
>>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
>>>>>>>> sp,
>>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
>>>>>>>> x0,
>>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
>>>>>>>> w0,
>>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
>>>>>>>> x0,
>>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
>>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
>>>>>> sp,
>>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
>>>>>>>>>>>
>>>>>>>>>>> If I understand correctly, the exception happened sometime earlier
>>> and
>>>>>>>> only
>>>>>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
>>>>>> a
>>>>>>>>>> result we immediately jump to the SError exception handler.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
>>>>>>>>>> running something "quite new" on some software (QEMU) and
>>>>>> hardware
>>>>>>>>>> (Synopsis) simulators.
>>>>>>>>>>
>>>>>>>>>> That would mean that you have new hardware with e.g. new memory
>>>>>> map
>>>>>>>>>> not used before. What you describe might sound like in the code before
>>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
>>>>>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
>>>>>>>>>> might be that you try to access (read/write) an address what is not
>>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
>>>>>>>>>> are able to modify the code before Linux (the boot loader?) you might
>>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
>>>>>>>>>> with that make the search window smaller. I'm not that familiar with
>>>>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
>>>>>>>>>> code does. And with that analyse all accesses and with that check if
>>>>>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
>>>>>>>>>> system? That should be checked from valid address and from hardware
>>>>>>>>>> subsystem enablement point of view.
>>>>>>>>>>
>>>>>>>>>> Hth,
>>>>>>>>>>
>>>>>>>>>> Dirk
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>      From the Linux source:
>>>>>>>>>>>           parse_early_param();
>>>>>>>>>>>
>>>>>>>>>>>           dynamic_scs_init();
>>>>>>>>>>>
>>>>>>>>>>>           /*
>>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up possible
>>>>>>>>>>>            * earlycon. (Report possible System Errors once we can report
>>> this
>>>>>>>>>>>            * occurred).
>>>>>>>>>>>            */
>>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
>>>>>> get
>>>>>>>> the
>>>>>>>>>> exception.
>>>>>>>>>>>
>>>>>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
>>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
>>>>>> (GNU
>>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
>>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
>>>>>>>>>>> lr : setup_arch+0x134/0x5ac
>>>>>>>>>>> sp : ffff8000808f3da0
>>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>>>>>>>>>> 0000000005e31b58c
>>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>>>>>>>>>> ffff8000808f8000c
>>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
>>>>>>>> ffff800080010000c
>>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
>>>>>> 000000002266684ac
>>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>>>>>>>>>> 0000000000000008c
>>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
>>>>>> 0000000000000003c
>>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
>>>>>>>> 0000000000000038c
>>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
>>>>>>>> 0000000000000001c
>>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>>>>>>>>>> 0000000000000065c
>>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>>>>>>>>>> 00000000000000c0c
>>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>>>>> Call trace:
>>>>>>>>>>>       dump_backtrace+0x9c/0xd0
>>>>>>>>>>>       show_stack+0x14/0x1c
>>>>>>>>>>>       dump_stack_lvl+0x44/0x58
>>>>>>>>>>>       dump_stack+0x14/0x1c
>>>>>>>>>>>       panic+0x2e0/0x33c
>>>>>>>>>>>       nmi_panic+0x68/0x6c
>>>>>>>>>>>       arm64_serror_panic+0x68/0x78
>>>>>>>>>>>       do_serror+0x24/0x54
>>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
>>>>>>>>>>>       el1h_64_error+0x64/0x68
>>>>>>>>>>>       setup_arch+0x13c/0x5ac
>>>>>>>>>>>       start_kernel+0x5c/0x5b8
>>>>>>>>>>>       __primary_switched+0xb4/0xbc
>>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>>>>>>>>>
>>>>>>>>>>> Can you please advice how to proceed with debugging?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advanced,
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Lior.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>> -- 
>>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
>>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de
> 

-- 
DENX Software Engineering GmbH,      Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de

^ permalink raw reply

* Re: Debugging early SError exception
From: Dirk Behme @ 2023-12-22  7:48 UTC (permalink / raw)
  To: Lior Weintraub, hs@denx.de; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <PR3P195MB055563303A0E0E2BB3E07A98C394A@PR3P195MB0555.EURP195.PROD.OUTLOOK.COM>

Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> Hi,
> 
> I managed to dump the __log_buf but for some reason the UART is still not working.
> Please note that UART printed all the U-BOOT traces so AFAIU, the device tree is set correctly.
> (Barebox is passing it's DTB into kernel).
> 
> To enable the earlyprintk I have:
> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
> 2. Modified the boot args to include: "console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd000307000"
> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart", early_serial8250_setup);
> 
>  From __log_buf dump:
> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107 SMP Thu Dec 21 17:33:12 IST 202323
> Machine model: Pliops Spider MK-I EVKVK
> efi: UEFI not found.d.
> Zone ranges:s:
>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
>    DMA32    emptyty
>    Normal   emptyty
> Movable zone start for each nodede
> Early memory node rangeses
>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]f]
> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> pcpu-alloc: [0] 0
> Detected VIPT I-cache on CPU0U0
> CPU features: GIC system register CPU interface present but disabled by higher exception levelel
> CPU features: detected: ARM erratum 84571919
> alternatives: applying boot alternativeses
> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd00030700000
> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)r)
> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> mem auto-init: stack:off, heap alloc:off, heap free:offff
> software IO TLB: area num 1.1.
> software IO TLB: mapped [mem 0x000000002b080000-0x000000002f080000] (64MB)B)
> Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)d)
> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> trace event string verifier disableded
> rcu: Hierarchical RCU implementation.n.
> rcu: 	RCU event tracing is enabled.d.
> rcu: 	RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> GICv3: 96 SPIs implementeded
> GICv3: 0 Extended SPIs implementeded
> Root IRQ handler: gic_handle_irqrq
> GICv3: GICv3 features: 16 PPIsIs
> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> GICv3: redistributor failed to wakeup.....
> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad

I think the two messages above are the essential ones.

Maybe it helps to check

https://www.kernel.org/doc/html/v5.3/arm64/booting.html

In the middle of that page in the "Call the kernel image" it has 
something about GIC:

-- cut --
If the kernel is entered at EL1:

         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
-- cut --

> Internal error: Oops - Undefined instruction: 0000000062383019 [#1] SMPMP
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> Hardware name: Pliops Spider MK-I EVK (DT)
> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> sp : ffff8000808f3b40
> x29: ffff8000808f3b40 x28: 0000000000000000 x27: 0000000000000001
> x26: ffff000000016040 x25: 0000000000000000 x24: ffff800080a6b000
> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> x11: 452074612064656c x10: 6261736964282045 x9 : 6428204552532074
> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> x5 : 000000000000002a x4 : 0000000000000000 x3 : 0000000000000000
> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
> Call trace:
>   gic_cpu_sys_reg_init+0x58/0x2e4
>   gic_cpu_init.part.0+0xa8/0x114
>   gic_init_bases+0x408/0x684
>   gic_of_init+0x298/0x300
>   of_irq_init+0x1c8/0x368
>   irqchip_init+0x14/0x1c
>   init_IRQ+0x98/0xac
>   start_kernel+0x250/0x5b8
>   __primary_switched+0xb4/0xbc
> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> ---[ end trace 0000000000000000 ]-----
> Kernel panic - not syncing: Attempted to kill the idle task!k!
> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> 
> 
> The kernel panic is related to GIC distributor (currently under debug) but AFAIU,
> this has nothing to do with the UART not working on early stages.


Yes, I agree. GIC issue and UART (at least the polling mode) should be 
indendent.

Best regards

Dirk


> Thanks in advanced for your advice,
> Cheers,
> Lior.
>   
> 
> 
>> -----Original Message-----
>> From: Heiko Schocher <hs@denx.de>
>> Sent: Thursday, December 21, 2023 1:37 PM
>> To: Lior Weintraub <liorw@pliops.com>
>> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-embedded@vger.kernel.org
>> Subject: Re: Debugging early SError exception
>>
>> [You don't often get email from hs@denx.de. Learn why this is important at
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> CAUTION: External Sender
>>
>> Hi Lior,
>>
>> On 21.12.23 12:19, Dirk Behme wrote:
>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
>>>> Thanks Dirk,
>>>>
>>>> Regarding the earlyprintk, not sure I know how to make it work.
>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my
>> config but it doesn't seem to work.
>>>> Do I need to pass something in the bootargs from the U-BOOT?
>>>> Do I need to add that into my device tree?
>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen"
>> on my DT but it didn't
>>>> work)
>>>
>>> Yes, what has to be enabled and what not and what has to be set how is often
>> confusing. I think this
>>> is not common for all systems, so I think to be on the safe side you have to look
>> into the code for
>>> you system. Or short; The code is the documentation ;)
>>>
>>>
>>>> The UART I am using is "snps,dw-apb-uart".
>>>>
>>>> Last week, to output the early logs I have implemented this hack:
>>>> 1. Modify printk macro to run my print_func
>>>> 2. This print_func wrote the characters into a single global variable (u32
>> simul_uart;)
>>>> 3. Get the address location of this global variable and extract all writes to it
>> from the Tarmac
>>>> logs.
>>>>
>>>> This is a very slow and tedious process but it helped me identify the initial
>> SError.
>>>> Initially I thought I can write directly into the UART FIFO register (which I know
>> the address)
>>>> but this didn't work because Linux already setup the MMU so I guess I need to
>> know the virtual
>>>> address of this FIFO.
>>>> Do I need to use __phys_to_virt of some sort?
>>>
>>> Yes, I think so. Have a look to the existing serial driver, too. It should do whats
>> needed, and you
>>> can borrow that, then.
>>
>> If you have access to the RAM after the crash (through a debugger or in
>> your bootloader) and your mem is stable, find out the address of __log_buf
>> in System.map. Thats the buffer where printk writes into it, and so dumping
>> the content is what you would see in case uart works...
>>
>> Hope it helps!
>>
>> bye,
>> Heiko
>>>
>>> Best regards
>>>
>>> Dirk
>>>
>>>
>>>> Cheers,
>>>> Lior.
>>>>
>>>>> -----Original Message-----
>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>> Sent: Thursday, December 21, 2023 10:30 AM
>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>>> Subject: Re: Debugging early SError exception
>>>>>
>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> CAUTION: External Sender
>>>>>
>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
>>>>>> Hi Dirk,
>>>>>>
>>>>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
>>>>> v2).
>>>>>
>>>>> Glad to hear that! :)
>>>>>
>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
>>>>> FIFO without checking if the FIFO is full.
>>>>>> Once the fifo got full it caused this SError probably because the UART IP
>>>>> generated an apberror signal.
>>>>>
>>>>> Thanks for the report!
>>>>>
>>>>>> Now the Linux is running and doesn't report the SError again but now we
>>>>> face another issue.
>>>>>> We see that the PC is getting into a "report_bug" function.
>>>>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
>>>>> the point where the console is configured?).
>>>>>
>>>>> For cases like this using earlyprintk is usually a good option. Check
>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
>>>>> supports it. In the end it should be "just" a function in the serial
>>>>> console driver which outputs the console data via polling before
>>>>> (later) the interrupt driven console part takes over.
>>>>>
>>>>> Best regards
>>>>>
>>>>> Dirk
>>>>>
>>>>>
>>>>>> Since our debug means are limited it can take some time to find the root
>>>>> cause.
>>>>>>
>>>>>> I will keep you posted and update our findings.
>>>>>> Love to hear your thoughts,
>>>>>>
>>>>>> Cheers,
>>>>>> Lior.
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
>>>>>>> Subject: Re: Debugging early SError exception
>>>>>>>
>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>
>>>>>>> CAUTION: External Sender
>>>>>>>
>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
>>>>>>>> Thanks Dirk,
>>>>>>>
>>>>>>> Welcome :)
>>>>>>>
>>>>>>> In case you find the root cause it would be nice to get some generic
>>>>>>> description of it so that we can learn something :)
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> Dirk
>>>>>>>
>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
>>>>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
>>>>> embedded@vger.kernel.org
>>>>>>>>> Subject: Re: Debugging early SError exception
>>>>>>>>>
>>>>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
>>>>> is
>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>>>
>>>>>>>>> CAUTION: External Sender
>>>>>>>>>
>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
>>>>>>>>> (Synopsys Zebu running our SoC model).
>>>>>>>>>> There is no debugger connected to this emulation but there are several
>>>>>>>>> debug capabilities we can use:
>>>>>>>>>> 1. Generating wave dump of CPU signals
>>>>>>>>>> 2. Generate a Tarmac log
>>>>>>>>>> 3. UART
>>>>>>>>>>
>>>>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
>>>>>>>>> enabled yet.
>>>>>>>>>>      From the Tarmac log we can see:
>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
>>>>>>>>> (parse_early_param)
>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
>>>>>>> x0,
>>>>>>>>> #0xc0   //      #192    (setup_arch)
>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
>>>>>>>>> daif,   x0      (setup_arch)
>>>>>>>>>>                          R CPSR 600000c5
>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with SP_ELx
>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
>>>>>>>>>>                          R CPSR 600003c5
>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
>>>>>>> sp,
>>>>>>>>> sp,     #0x150  (vectors)
>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
>>>>>>> sp,
>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
>>>>>>> x0,
>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
>>>>>>> w0,
>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
>>>>>>> x0,
>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
>>>>> sp,
>>>>>>>>> sp,     x0      (vectors)
>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
>>>>>>>>>>
>>>>>>>>>> If I understand correctly, the exception happened sometime earlier
>> and
>>>>>>> only
>>>>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
>>>>> a
>>>>>>>>> result we immediately jump to the SError exception handler.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
>>>>>>>>> running something "quite new" on some software (QEMU) and
>>>>> hardware
>>>>>>>>> (Synopsis) simulators.
>>>>>>>>>
>>>>>>>>> That would mean that you have new hardware with e.g. new memory
>>>>> map
>>>>>>>>> not used before. What you describe might sound like in the code before
>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
>>>>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
>>>>>>>>> might be that you try to access (read/write) an address what is not
>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
>>>>>>>>> are able to modify the code before Linux (the boot loader?) you might
>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
>>>>>>>>> with that make the search window smaller. I'm not that familiar with
>>>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
>>>>>>>>> code does. And with that analyse all accesses and with that check if
>>>>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
>>>>>>>>> system? That should be checked from valid address and from hardware
>>>>>>>>> subsystem enablement point of view.
>>>>>>>>>
>>>>>>>>> Hth,
>>>>>>>>>
>>>>>>>>> Dirk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>      From the Linux source:
>>>>>>>>>>           parse_early_param();
>>>>>>>>>>
>>>>>>>>>>           dynamic_scs_init();
>>>>>>>>>>
>>>>>>>>>>           /*
>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up possible
>>>>>>>>>>            * earlycon. (Report possible System Errors once we can report
>> this
>>>>>>>>>>            * occurred).
>>>>>>>>>>            */
>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
>>>>> get
>>>>>>> the
>>>>>>>>> exception.
>>>>>>>>>>
>>>>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
>>>>> (GNU
>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
>>>>>>>>>> lr : setup_arch+0x134/0x5ac
>>>>>>>>>> sp : ffff8000808f3da0
>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
>>>>>>>>> 0000000005e31b58c
>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
>>>>>>>>> ffff8000808f8000c
>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
>>>>>>> ffff800080010000c
>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
>>>>> 000000002266684ac
>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
>>>>>>>>> 0000000000000008c
>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
>>>>> 0000000000000003c
>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
>>>>>>> 0000000000000038c
>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
>>>>>>> 0000000000000001c
>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
>>>>>>>>> 0000000000000065c
>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
>>>>>>>>> 00000000000000c0c
>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
>>>>>>>>>> Call trace:
>>>>>>>>>>       dump_backtrace+0x9c/0xd0
>>>>>>>>>>       show_stack+0x14/0x1c
>>>>>>>>>>       dump_stack_lvl+0x44/0x58
>>>>>>>>>>       dump_stack+0x14/0x1c
>>>>>>>>>>       panic+0x2e0/0x33c
>>>>>>>>>>       nmi_panic+0x68/0x6c
>>>>>>>>>>       arm64_serror_panic+0x68/0x78
>>>>>>>>>>       do_serror+0x24/0x54
>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
>>>>>>>>>>       el1h_64_error+0x64/0x68
>>>>>>>>>>       setup_arch+0x13c/0x5ac
>>>>>>>>>>       start_kernel+0x5c/0x5b8
>>>>>>>>>>       __primary_switched+0xb4/0xbc
>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>>>>>>>>>>
>>>>>>>>>> Can you please advice how to proceed with debugging?
>>>>>>>>>>
>>>>>>>>>> Thanks in advanced,
>>>>>>>>>> Cheers,
>>>>>>>>>> Lior.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>
>> --
>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de


^ permalink raw reply

* RE: Debugging early SError exception
From: Lior Weintraub @ 2023-12-22  7:03 UTC (permalink / raw)
  To: hs@denx.de, Dirk Behme; +Cc: linux-embedded@vger.kernel.org
In-Reply-To: <a288e1c4-8637-34bc-b6a3-c9aa3edb22e6@denx.de>

Hi,

I managed to dump the __log_buf but for some reason the UART is still not working.
Please note that UART printed all the U-BOOT traces so AFAIU, the device tree is set correctly.
(Barebox is passing it's DTB into kernel).

To enable the earlyprintk I have:
1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
2. Modified the boot args to include: "console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd000307000"
3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart", early_serial8250_setup);

From __log_buf dump:
Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107 SMP Thu Dec 21 17:33:12 IST 202323
Machine model: Pliops Spider MK-I EVKVK
efi: UEFI not found.d.
Zone ranges:s:
  DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
  DMA32    emptyty
  Normal   emptyty
Movable zone start for each nodede
Early memory node rangeses
  node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]f]
percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
pcpu-alloc: [0] 0 
Detected VIPT I-cache on CPU0U0
CPU features: GIC system register CPU interface present but disabled by higher exception levelel
CPU features: detected: ARM erratum 84571919
alternatives: applying boot alternativeses
Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-uart,0xd00030700000
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)r)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
Built 1 zonelists, mobility grouping on.  Total pages: 19353636
mem auto-init: stack:off, heap alloc:off, heap free:offff
software IO TLB: area num 1.1.
software IO TLB: mapped [mem 0x000000002b080000-0x000000002f080000] (64MB)B)
Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)d)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
trace event string verifier disableded
rcu: Hierarchical RCU implementation.n.
rcu: 	RCU event tracing is enabled.d.
rcu: 	RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
GICv3: 96 SPIs implementeded
GICv3: 0 Extended SPIs implementeded
Root IRQ handler: gic_handle_irqrq
GICv3: GICv3 features: 16 PPIsIs
GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
GICv3: redistributor failed to wakeup.....
GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
Internal error: Oops - Undefined instruction: 0000000062383019 [#1] SMPMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
Hardware name: Pliops Spider MK-I EVK (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : gic_cpu_sys_reg_init+0x58/0x2e4
lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
sp : ffff8000808f3b40
x29: ffff8000808f3b40 x28: 0000000000000000 x27: 0000000000000001
x26: ffff000000016040 x25: 0000000000000000 x24: ffff800080a6b000
x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
x11: 452074612064656c x10: 6261736964282045 x9 : 6428204552532074
x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
x5 : 000000000000002a x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
Call trace:
 gic_cpu_sys_reg_init+0x58/0x2e4
 gic_cpu_init.part.0+0xa8/0x114
 gic_init_bases+0x408/0x684
 gic_of_init+0x298/0x300
 of_irq_init+0x1c8/0x368
 irqchip_init+0x14/0x1c
 init_IRQ+0x98/0xac
 start_kernel+0x250/0x5b8
 __primary_switched+0xb4/0xbc
Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) ) 
---[ end trace 0000000000000000 ]-----
Kernel panic - not syncing: Attempted to kill the idle task!k!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----


The kernel panic is related to GIC distributor (currently under debug) but AFAIU, 
this has nothing to do with the UART not working on early stages.

Thanks in advanced for your advice,
Cheers,
Lior.
 


> -----Original Message-----
> From: Heiko Schocher <hs@denx.de>
> Sent: Thursday, December 21, 2023 1:37 PM
> To: Lior Weintraub <liorw@pliops.com>
> Cc: Dirk Behme <dirk.behme@gmail.com>; linux-embedded@vger.kernel.org
> Subject: Re: Debugging early SError exception
> 
> [You don't often get email from hs@denx.de. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> CAUTION: External Sender
> 
> Hi Lior,
> 
> On 21.12.23 12:19, Dirk Behme wrote:
> > Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> >> Thanks Dirk,
> >>
> >> Regarding the earlyprintk, not sure I know how to make it work.
> >> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my
> config but it doesn't seem to work.
> >> Do I need to pass something in the bootargs from the U-BOOT?
> >> Do I need to add that into my device tree?
> >> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under "chosen"
> on my DT but it didn't
> >> work)
> >
> > Yes, what has to be enabled and what not and what has to be set how is often
> confusing. I think this
> > is not common for all systems, so I think to be on the safe side you have to look
> into the code for
> > you system. Or short; The code is the documentation ;)
> >
> >
> >> The UART I am using is "snps,dw-apb-uart".
> >>
> >> Last week, to output the early logs I have implemented this hack:
> >> 1. Modify printk macro to run my print_func
> >> 2. This print_func wrote the characters into a single global variable (u32
> simul_uart;)
> >> 3. Get the address location of this global variable and extract all writes to it
> from the Tarmac
> >> logs.
> >>
> >> This is a very slow and tedious process but it helped me identify the initial
> SError.
> >> Initially I thought I can write directly into the UART FIFO register (which I know
> the address)
> >> but this didn't work because Linux already setup the MMU so I guess I need to
> know the virtual
> >> address of this FIFO.
> >> Do I need to use __phys_to_virt of some sort?
> >
> > Yes, I think so. Have a look to the existing serial driver, too. It should do whats
> needed, and you
> > can borrow that, then.
> 
> If you have access to the RAM after the crash (through a debugger or in
> your bootloader) and your mem is stable, find out the address of __log_buf
> in System.map. Thats the buffer where printk writes into it, and so dumping
> the content is what you would see in case uart works...
> 
> Hope it helps!
> 
> bye,
> Heiko
> >
> > Best regards
> >
> > Dirk
> >
> >
> >> Cheers,
> >> Lior.
> >>
> >>> -----Original Message-----
> >>> From: Dirk Behme <dirk.behme@gmail.com>
> >>> Sent: Thursday, December 21, 2023 10:30 AM
> >>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >>> Subject: Re: Debugging early SError exception
> >>>
> >>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> CAUTION: External Sender
> >>>
> >>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> >>>> Hi Dirk,
> >>>>
> >>>> We found that the issue was at the early stages of Barebox (a.k.a U-BOOT
> >>> v2).
> >>>
> >>> Glad to hear that! :)
> >>>
> >>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx
> >>> FIFO without checking if the FIFO is full.
> >>>> Once the fifo got full it caused this SError probably because the UART IP
> >>> generated an apberror signal.
> >>>
> >>> Thanks for the report!
> >>>
> >>>> Now the Linux is running and doesn't report the SError again but now we
> >>> face another issue.
> >>>> We see that the PC is getting into a "report_bug" function.
> >>>> The Linux doesn't print anything to the UART (probably since it hasn't got to
> >>> the point where the console is configured?).
> >>>
> >>> For cases like this using earlyprintk is usually a good option. Check
> >>> the Linux kernel serial console (UART) dirver of you SoC if it
> >>> supports it. In the end it should be "just" a function in the serial
> >>> console driver which outputs the console data via polling before
> >>> (later) the interrupt driven console part takes over.
> >>>
> >>> Best regards
> >>>
> >>> Dirk
> >>>
> >>>
> >>>> Since our debug means are limited it can take some time to find the root
> >>> cause.
> >>>>
> >>>> I will keep you posted and update our findings.
> >>>> Love to hear your thoughts,
> >>>>
> >>>> Cheers,
> >>>> Lior.
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> >>>>> To: Lior Weintraub <liorw@pliops.com>; linux-embedded@vger.kernel.org
> >>>>> Subject: Re: Debugging early SError exception
> >>>>>
> >>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this is
> >>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>
> >>>>> CAUTION: External Sender
> >>>>>
> >>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> >>>>>> Thanks Dirk,
> >>>>>
> >>>>> Welcome :)
> >>>>>
> >>>>> In case you find the root cause it would be nice to get some generic
> >>>>> description of it so that we can learn something :)
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Dirk
> >>>>>
> >>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Dirk Behme <dirk.behme@gmail.com>
> >>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> >>>>>>> To: Lior Weintraub <liorw@pliops.com>; linux-
> >>> embedded@vger.kernel.org
> >>>>>>> Subject: Re: Debugging early SError exception
> >>>>>>>
> >>>>>>> [You don't often get email from dirk.behme@gmail.com. Learn why this
> >>> is
> >>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>>>>>
> >>>>>>> CAUTION: External Sender
> >>>>>>>
> >>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> >>>>>>>> This SoC is ARM64 (A53) single core based device.
> >>>>>>>> It runs correctly on QEMU but fails with SError on emulation platform
> >>>>>>> (Synopsys Zebu running our SoC model).
> >>>>>>>> There is no debugger connected to this emulation but there are several
> >>>>>>> debug capabilities we can use:
> >>>>>>>> 1. Generating wave dump of CPU signals
> >>>>>>>> 2. Generate a Tarmac log
> >>>>>>>> 3. UART
> >>>>>>>>
> >>>>>>>> Since the SError happens at early stages of Linux boot the UART is not
> >>>>>>> enabled yet.
> >>>>>>>>     From the Tarmac log we can see:
> >>>>>>>>      3824884521 ps  ES  (ffff800080760888:d65f03c0) O el1h_ns:   ret
> >>>>>>> (parse_early_param)
> >>>>>>>>      3824884522 ps  ES  (ffff800080763a60:d2801800) O el1h_ns:   mov
> >>>>> x0,
> >>>>>>> #0xc0   //      #192    (setup_arch)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884523 ps  ES  (ffff800080763a64:d51b4220) O el1h_ns:   msr
> >>>>>>> daif,   x0      (setup_arch)
> >>>>>>>>                         R CPSR 600000c5
> >>>>>>>>      3824884529 ps  ES  System Error (Abort)
> >>>>>>>>                         EXC [0x380] SError/vSError Current EL with SP_ELx
> >>>>>>>>                         R ESR_EL1 (AARCH64) bf000002
> >>>>>>>>                         R CPSR 600003c5
> >>>>>>>>                         R SPSR_EL1 (AARCH64) 600000c5
> >>>>>>>>                         R ELR_EL1 (AARCH64) ffff8000 80763a68
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b80:d10543ff) O el1h_ns:   sub
> >>>>> sp,
> >>>>>>> sp,     #0x150  (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884925 ps  ES  (ffff800080010b84:8b2063ff) O el1h_ns:   add
> >>>>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3d10
> >>>>>>>>      3824884926 ps  ES  (ffff800080010b88:cb2063e0) O el1h_ns:   sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884927 ps  ES  (ffff800080010b8c:37700080) O el1h_ns:   tbnz
> >>>>> w0,
> >>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> >>>>>>>>      3824884935 ps  ES  (ffff800080010b90:cb2063e0) O el1h_ns:   sub
> >>>>> x0,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R X0 (AARCH64) 00000000 000000c0
> >>>>>>>>      3824884937 ps  ES  (ffff800080010b94:cb2063ff) O el1h_ns:   sub
> >>> sp,
> >>>>>>> sp,     x0      (vectors)
> >>>>>>>>                         R SP_EL1 (AARCH64) ffff8000 808f3c50
> >>>>>>>>      3824884938 ps  ES  (ffff800080010b98:140001ef) O el1h_ns:   b
> >>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> >>>>>>>>
> >>>>>>>> If I understand correctly, the exception happened sometime earlier
> and
> >>>>> only
> >>>>>>> now Linux boot code (setup_arch) opened the exception handling and as
> >>> a
> >>>>>>> result we immediately jump to the SError exception handler.
> >>>>>>>
> >>>>>>>
> >>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> >>>>>>> running something "quite new" on some software (QEMU) and
> >>> hardware
> >>>>>>> (Synopsis) simulators.
> >>>>>>>
> >>>>>>> That would mean that you have new hardware with e.g. new memory
> >>> map
> >>>>>>> not used before. What you describe might sound like in the code before
> >>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> >>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it
> >>>>>>> might be that you try to access (read/write) an address what is not
> >>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> >>>>>>> are able to modify the code before Linux (the boot loader?) you might
> >>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> >>>>>>> with that make the search window smaller. I'm not that familiar with
> >>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your
> >>>>>>> code does. And with that analyse all accesses and with that check if
> >>>>>>> all these accesses are valid even on the hardware (Synopsis) emulation
> >>>>>>> system? That should be checked from valid address and from hardware
> >>>>>>> subsystem enablement point of view.
> >>>>>>>
> >>>>>>> Hth,
> >>>>>>>
> >>>>>>> Dirk
> >>>>>>>
> >>>>>>>
> >>>>>>>>     From the Linux source:
> >>>>>>>>          parse_early_param();
> >>>>>>>>
> >>>>>>>>          dynamic_scs_init();
> >>>>>>>>
> >>>>>>>>          /*
> >>>>>>>>           * Unmask asynchronous aborts and fiq after bringing up possible
> >>>>>>>>           * earlycon. (Report possible System Errors once we can report
> this
> >>>>>>>>           * occurred).
> >>>>>>>>           */
> >>>>>>>>          local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when we
> >>> get
> >>>>> the
> >>>>>>> exception.
> >>>>>>>>
> >>>>>>>> After some kernel hacking (replacing printk) we could extract the logs:
> >>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> >>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld
> >>> (GNU
> >>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> >>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> >>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>>>>> pc : setup_arch+0x13c/0x5ac
> >>>>>>>> lr : setup_arch+0x134/0x5ac
> >>>>>>>> sp : ffff8000808f3da0
> >>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> >>>>>>> 0000000005e31b58c
> >>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> >>>>>>> ffff8000808f8000c
> >>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> >>>>> ffff800080010000c
> >>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> >>> 000000002266684ac
> >>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> >>>>>>> 0000000000000008c
> >>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> >>> 0000000000000003c
> >>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> >>>>> 0000000000000038c
> >>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> >>>>> 0000000000000001c
> >>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> >>>>>>> 0000000000000065c
> >>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> >>>>>>> 00000000000000c0c
> >>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> >>>>>>>> Call trace:
> >>>>>>>>      dump_backtrace+0x9c/0xd0
> >>>>>>>>      show_stack+0x14/0x1c
> >>>>>>>>      dump_stack_lvl+0x44/0x58
> >>>>>>>>      dump_stack+0x14/0x1c
> >>>>>>>>      panic+0x2e0/0x33c
> >>>>>>>>      nmi_panic+0x68/0x6c
> >>>>>>>>      arm64_serror_panic+0x68/0x78
> >>>>>>>>      do_serror+0x24/0x54
> >>>>>>>>      el1h_64_error_handler+0x2c/0x40
> >>>>>>>>      el1h_64_error+0x64/0x68
> >>>>>>>>      setup_arch+0x13c/0x5ac
> >>>>>>>>      start_kernel+0x5c/0x5b8
> >>>>>>>>      __primary_switched+0xb4/0xbc
> >>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >>>>>>>>
> >>>>>>>> Can you please advice how to proceed with debugging?
> >>>>>>>>
> >>>>>>>> Thanks in advanced,
> >>>>>>>> Cheers,
> >>>>>>>> Lior.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
> 
> --
> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@denx.de

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox