Re: Orange PI 5 MAX: very unstable using kernel 6.19.0 and 6.18.10, 6.18.9 perfectly stable

From: David Arendt <admin@prnet.org>
To: Qu Wenruo <wqu@suse.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-rockchip@lists.infradead.org,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Orange PI 5 MAX: very unstable using kernel 6.19.0 and 6.18.10, 6.18.9 perfectly stable
Date: Thu, 12 Feb 2026 23:23:44 +0100	[thread overview]
Message-ID: <de166323-bb9d-4240-bc42-08ae32067284@prnet.org> (raw)
In-Reply-To: <f95f0d27-5bee-4363-b0f0-75e95b2a470d@suse.com>

On 2/12/26 10:05 PM, Qu Wenruo wrote:
>
>
> 在 2026/2/13 06:41, David Arendt 写道:
>> Hello,
>>
>> I am using a Kubernetes Cluster with 3 Orange PI5 MAX nodes. The data 
>> is stored using a btrfs filesystem as backend. If using kernel 6.19.0 
>> or kernel 6.18.10 I have experienced many crashes during high IO load 
>> on all 3 nodes. Reverting back to 6.18.9 solves the problems 
>> completely. Unfortunately the crashes are spontaneous reboots without 
>> leaving a trace in any logfile, so I have no stacktrace of them. 
>> After the crashes I have sometimes incorrect btrfs csums for a file 
>> but these may also be a result of a partial write due to the crash. 
>> On one node I had a btrfs error logged without crashing, but I am not 
>> sure if this is the root cause or a result of a prior crash. A scrub 
>> after reboot returned no error with 6.19.0.
>
> The offending tree dump items are:
>
> Feb 10 13:31:07 opi02 kernel:  item 92 key (13218356101120
> Feb 10 13:31:07 opi02 kernel:  item 93 key (13216208642048
> Feb 10 13:31:07 opi02 kernel:  item 94 key (13218356162560
>
> Obviously item 93 is smaller than all its previous and next item keys.
>
> hex(13218356101120) = 0xc05a36b8000
> hex(13216208642048) = 0xc05236be000
> hex(13218356162560) = 0xc05a36c7000
>
> It looks like something fliped, "0xc05a3" -> "0xc0523"
>
> 0xa -> 0x2 is exactly one bit flipped.
>
> So either the memory hardware has something wrong and resulting a 
> sticking bit (always 0), or there is something inside the kernel 
> touching memory it shouldn't.
>
> And this exactly matches the symptom, changing random bit of your 
> kernel, crash always expected.
>
>
> Can you run a memtest to make sure it is not hardware problems first?

Hello,

I don't know of anything like memtest86 for the arm64 platform for 
testing the whole memory, so I used the user space memtester to check 
the 14G of unused ram on all 3 machines while using kernel 6.18.10.

Here is the result of the first iteration (same on every machine):

memtester version 4.7.1 (64-bit)
Copyright (C) 2001-2024 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 14000MB (14680064000 bytes)
got  14000MB (14680064000 bytes), trying mlock ...locked.
Loop 1:
   Stuck Address       : ok
   Random Value        : ok
   Compare XOR         : ok
   Compare SUB         : ok
   Compare MUL         : ok
   Compare DIV         : ok
   Compare OR          : ok
   Compare AND         : ok
   Sequential Increment: ok
   Solid Bits          : ok
   Block Sequential    : ok
   Checkerboard        : ok
   Bit Spread          : ok
   Bit Flip            : ok
   Walking Ones        : ok
   Walking Zeroes      : ok

I don't think it is hardware a failure as it is happening on 3 different 
machines. Crashes occur somewhere between 30 minutes and 12 hours on all 
3 machines that have been running without a single crash for more than a 
year now with older kernel versions including 4 days with 6.18.9 and all 
version from 6.18.0 to 6.18.9, so it seems to be caused by something 
that has changed between 6.18.9 and 6.18.10.

Thanks,

David Arendt

>
> Thanks,
> Qu
>
>
>>
>> Unfortunately I don't have more information at the moment.
>>
>> Thanks in advance,
>>
>> David Arendt
>>
>>
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip