From: Thomas Gleixner <tglx@kernel.org>
To: Feng Tang <feng.tang@linux.alibaba.com>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Petr Mladek <pmladek@suse.com>,
Steven Rostedt <rostedt@goodmis.org>,
paulmck@kernel.org, Douglas Anderson <dianders@chromium.org>,
Peter Zijlstra <peterz@infradead.org>,
Vlastimil Babka <vbabka@kernel.org>,
linux-kernel@vger.kernel.org,
Ard Biesheuvel <ardb+git@google.com>
Subject: Re: [PATCH v1] kernel: add a simple timer based software watchpoint
Date: Fri, 26 Jun 2026 11:16:08 +0200 [thread overview]
Message-ID: <87jyrlk6g7.ffs@fw13> (raw)
In-Reply-To: <aj3cRtxNmWYW48Wf@U-2FWC9VHC-2323.local>
On Fri, Jun 26 2026 at 09:56, Feng Tang wrote:
> On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
>> > ability to do the virtual to physical address translation instantly to
>> > watch a _physical_ address. So I guess, not able to watchpoint a physical
>> > address may be common for HW debuggers (I could be very wrong).
>>
>> If the hardware debugger and the underlying CPU facility (ETM on ARM64
>> IIRC) does not support triggers on physical addresses and you already
>> concluded from other information that the problem is in the BIOS, then
>> tracing the kernel with it's virt/phys translation is not going to
>> work. You obviously have to use the BIOS translation which might be very
>> different, no?
>
> I didn't explain the issue clearly. The order for solving this issue was,
> we first used this method to halt (while (1) dead loop) the system when
> detecting the memory corruption, silicon engineers gathered hardware
> traces, then root caused it. Before that, we didn't know it's a BIOS issue,
> as the initial symptom was random user space "segmentation fault"
Sure the initial symptom was a user space fault and you could not
explain it. But you really don't need your magic hack to figure out that
it tripped over a corrupted byte in the zero page or wherever.
Once you have that figured out and established that it's reproducible
then you add a watchpoint on that address in the kernel which won't
trigger. So that excludes the kernel and points to the BIOS, which in
turn makes you put a watchpoint on the BIOS translation.
If you need that hack to decode it, then you should rethink your
approach to structured problem analysis and deduction.
>> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
>> > we also used this method to solve one issue that BIOS runtime service
>> > corrupting ACPI_ENABLE register issue.
>>
>> Again, if the BIOS runtime service changes virt/phys translation the you
>> have to trace the BIOS not the kernel. It's pretty obvious, no?
>
> Similarly, I didn't make it clear that the issue was not about address
> translation.
>
> The bug report I got from test engineers was, the ACPI_ENABLE register
> has right value from BIOS boot message, and after booting to OS, it was
> changed to a strange value. So initially the suspect was us OS guy :).
> And we used the 'approching" policy of the method, checked the kernel
> logs (we added many debug ones) before the corruption was detected, and
> found right before the corruption, there was a RTC runtime service
> calling record, and asked BIOS engineer to check, which root caused it.
>
> So the idea was to find the activites before the happening of "corrution",
> and check if there was some clues.
Again. You failed to structure the problem and use the tools correctly.
>> > Then I tried to recall some old memory corruption issues I've met before,
>> > and think about if there is some that could be captured by this method,
>> > one example was a static global array overflow issue, which corrupted
>> > some other global variables which was next to it in kernel bss segment.
>>
>> No. This is just all catching the problem after the fact with no trace
>> and conclusive information about the root cause. The tools are there,
>> you just have to use them correctly. But sure creating magic hacks which
>> by chance give you the same information is way better...
>
> This issue was interesting. It showed up as a NULL pointer panic, and I
> found it's a global variable (in bss segment) being corrupted (which shouldn't
> happen logically). As it didn't happened on normal platforms, but one platform
> with special config, we think it could be silicon related, and sent it to
> silicon team, who did root cause it with gathering/analyzing silicon traces to
> be an array overflow issue, as the special config make that array much longer.
Your debug war stories are amazing, but in the wrong way and do not
justify to shove a completely ill defined barely usable hack into the
kernel to be maintained forever.
Thanks,
tglx
next prev parent reply other threads:[~2026-06-26 9:16 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-22 8:14 [PATCH v1] kernel: add a simple timer based software watchpoint Feng Tang
2026-06-22 8:42 ` David Hildenbrand (Arm)
2026-06-22 10:53 ` Thomas Gleixner
2026-06-22 12:45 ` Feng Tang
2026-06-22 14:13 ` David Hildenbrand (Arm)
2026-06-23 8:26 ` Feng Tang
2026-06-24 9:04 ` Thomas Gleixner
2026-06-24 10:21 ` David Hildenbrand (Arm)
2026-06-24 11:16 ` Feng Tang
2026-06-24 11:12 ` Feng Tang
2026-06-25 21:30 ` Thomas Gleixner
2026-06-26 1:56 ` Feng Tang
2026-06-26 2:57 ` Feng Tang
2026-06-26 6:50 ` Feng Tang
2026-06-26 9:16 ` Thomas Gleixner [this message]
2026-06-26 14:33 ` Feng Tang
2026-06-26 15:31 ` Thomas Gleixner
2026-06-23 17:26 ` Julian Braha
2026-06-24 2:43 ` Feng Tang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jyrlk6g7.ffs@fw13 \
--to=tglx@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ardb+git@google.com \
--cc=david@kernel.org \
--cc=dianders@chromium.org \
--cc=feng.tang@linux.alibaba.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rostedt@goodmis.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox