linux-amlogic.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Agner <stefan@agner.ch>
To: Robin Murphy <robin.murphy@arm.com>, Byron Stanoszek <gandalf@winds.org>
Cc: linux-amlogic@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org,
	Neil Armstrong <narmstrong@baylibre.com>,
	Jerome Brunet <jbrunet@baylibre.com>,
	Kevin Hilman <khilman@baylibre.com>,
	Martin Blumenstingl <martin.blumenstingl@googlemail.com>,
	Mike Rapoport <rppt@kernel.org>
Subject: Re: Random reboots on ODROID-N2+
Date: Fri, 23 Jul 2021 17:56:01 +0200	[thread overview]
Message-ID: <d1fefb1ec2e7ed1b40083426ed7102c6@agner.ch> (raw)
In-Reply-To: <c2a16fbc-653f-5993-f9bb-9ee8c707515e@arm.com>

Hi Byron, Hi Robin,

Very interesting findings!

On 2021-07-23 17:36, Robin Murphy wrote:
> On 2021-07-23 15:25, Byron Stanoszek wrote:
>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>
>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>> received since several reports of random reboots every couple of days.
>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>> at some point.
>>>>
>>>> After running serial console on several instances, I was able to catch
>>>> this stack trace:
>>>>
>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>> #1
>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>
>>> <snip>
>>>
>>> We do see those crashes in similar frequency with Linux 5.12:
>>>
>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>
>>> It seems load and/or hardware dependent since we see it on some devices
>>> quite frequent (every few days), and on others it takes multiple weeks.
>>> Of course the once we see it frequently are the ones in production :).
>>>
>>> I am currently trying different stress-ng and other load to accelerate
>>> the crash rate before then trying to git bisect it.
>>
>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>> related to the following dmesg line that reads "failed to reserve memory"
>> below:
>>
>> Machine model: Hardkernel ODROID-N2Plus
>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB

In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.

>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>> OF: reserved mem: node linux,cma compatible matching fail
>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>> ...
>>
>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>> and the system appears to operate normally, until eventually the SError
>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>> the "System RAM" memory area, whereas previous kernels had dropped it from
>> "System RAM".
>>
>> The culprit is this new code introduced in Linux 5.12, in this function in
>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():

It seems that patch got also backported, so that is why I see it with
5.10 as well.

>>
>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>                                          phys_addr_t size, bool nomap)
>> {
>>          if (nomap) {
>>                  /*
>>                   * If the memory is already reserved (by another region), we
>>                   * should not allow it to be marked nomap.
>>                   */
>>                  if (memblock_is_region_reserved(base, size))  <------
>>                          return -EBUSY;                        <------
>>
>>                  return memblock_mark_nomap(base, size);
>>          }
>>          return memblock_reserve(base, size);
>> }
>>
>> "nomap" is true, due to this text being present in the FDT:
>>
>>     reserved-memory {
>>       ranges secmon_reserved: secmon@5000000 {
>>         reg = <0x0 0x05000000 0x0 0x300000>
>>         no-map
>>       }
>>       ...
>>
>> But memblock_is_region_reserved() is returning true because there is already an
>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>
>> This function is defined as:
>>
>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>> {
>>          return memblock_overlaps_region(&memblock.reserved, base, size);
>> }
>>
>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>> reserved region "0x5000000-0x52fffff", the function returns true.
>>
>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>> allow it to mark the region no-map, then the memory area is properly removed
>> from the "System RAM" area and the crashing stops.
>>
>> I've had the system up and running for 15 days now under heavy load without any
>> crashes, using just the following patch as workaround:
>>
>>
>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 -0400
>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
>> @@ -1157,13 +1157,6 @@
>>                       phys_addr_t size, bool nomap)
>>   {
>>       if (nomap) {
>> -        /*
>> -         * If the memory is already reserved (by another region), we
>> -         * should not allow it to be marked nomap.
>> -         */
>> -        if (memblock_is_region_reserved(base, size))
>> -            return -EBUSY;
>> -
>>           return memblock_mark_nomap(base, size);
>>       }
>>       return memblock_reserve(base, size);
>>
>>
>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>> well.

Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.

>>
>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>> that the existing reserved region is identical (same start/end) to the region
>> getting marked no-map.
> 
> If U-Boot is marking regions with the wrong type/attributes in the EFI
> memory map, then the best thing to do would be to fix that. I see a
> fairly recent commit which looks suspiciously relevant:
> 
> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004

It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.

> 
> Booting with "efi=debug" should (among other things) print the memory
> map at boot if you want to double-check that that is the source of the
> mismatch. Our EFI code should be perfectly capable of setting the
> memblock flag if the region *is* described appropriately, see
> reserve_regions() in drivers/firmware/efi/efi-init.c.

Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[    0.000000] Machine model: Hardkernel ODROID-N2Plus
[    0.000000] efi: Getting UEFI parameters from /chosen in DT:
[    0.000000] efi: UEFI not found.
[    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon@5000000': base 0x0000000005000000, size 3 MiB

So it seems UEFI is not in the play here?

--
Stefan

_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

  reply	other threads:[~2021-07-23 15:56 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
2021-05-17 21:09 ` Martin Blumenstingl
2021-05-18  9:16   ` Stefan Agner
2021-05-18  9:35     ` Neil Armstrong
2021-05-18  1:33 ` Andrew Lunn
2021-05-18 10:15   ` Stefan Agner
2021-05-19 20:09 ` Stefan Agner
2021-06-22  7:39 ` Stefan Agner
2021-07-23 14:25   ` Byron Stanoszek
2021-07-23 15:36     ` Robin Murphy
2021-07-23 15:56       ` Stefan Agner [this message]
2021-07-23 16:14         ` Robin Murphy
2021-07-23 17:47           ` Robin Murphy
2021-07-23 19:48             ` Stefan Agner
2021-07-26  7:54               ` Neil Armstrong
2021-07-26 12:07                 ` Stefan Agner
2021-07-26 12:31                   ` Robin Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1fefb1ec2e7ed1b40083426ed7102c6@agner.ch \
    --to=stefan@agner.ch \
    --cc=gandalf@winds.org \
    --cc=jbrunet@baylibre.com \
    --cc=khilman@baylibre.com \
    --cc=linux-amlogic@lists.infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=martin.blumenstingl@googlemail.com \
    --cc=narmstrong@baylibre.com \
    --cc=robin.murphy@arm.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).