The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: Feng Tang <feng.tang@linux.alibaba.com>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	paulmck@kernel.org, Douglas Anderson <dianders@chromium.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	linux-kernel@vger.kernel.org,
	Ard Biesheuvel <ardb+git@google.com>
Subject: Re: [PATCH v1] kernel: add a simple timer based software watchpoint
Date: Fri, 26 Jun 2026 11:16:08 +0200	[thread overview]
Message-ID: <87jyrlk6g7.ffs@fw13> (raw)
In-Reply-To: <aj3cRtxNmWYW48Wf@U-2FWC9VHC-2323.local>

On Fri, Jun 26 2026 at 09:56, Feng Tang wrote:
> On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
>> > ability to do the virtual to physical address translation instantly to
>> > watch a _physical_ address. So I guess, not able to watchpoint a physical
>> > address may be common for HW debuggers (I could be very wrong).
>> 
>> If the hardware debugger and the underlying CPU facility (ETM on ARM64
>> IIRC) does not support triggers on physical addresses and you already
>> concluded from other information that the problem is in the BIOS, then
>> tracing the kernel with it's virt/phys translation is not going to
>> work. You obviously have to use the BIOS translation which might be very
>> different, no?
>
> I didn't explain the issue clearly. The order for solving this issue was,
> we first used this method to halt (while (1) dead loop) the system when
> detecting the memory corruption,  silicon engineers gathered hardware
> traces, then root caused it. Before that, we didn't know it's a BIOS issue,
> as the initial symptom was random user space "segmentation fault"

Sure the initial symptom was a user space fault and you could not
explain it. But you really don't need your magic hack to figure out that
it tripped over a corrupted byte in the zero page or wherever.

Once you have that figured out and established that it's reproducible
then you add a watchpoint on that address in the kernel which won't
trigger. So that excludes the kernel and points to the BIOS, which in
turn makes you put a watchpoint on the BIOS translation.

If you need that hack to decode it, then you should rethink your
approach to structured problem analysis and deduction.
  
>> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
>> > we also used this method to solve one issue that BIOS runtime service
>> > corrupting ACPI_ENABLE register issue.
>> 
>> Again, if the BIOS runtime service changes virt/phys translation the you
>> have to trace the BIOS not the kernel. It's pretty obvious, no?
>  
> Similarly, I didn't make it clear that the issue was not about address
> translation.
>
> The bug report I got from test engineers was, the ACPI_ENABLE register
> has right value from BIOS boot message, and after booting to OS, it was
> changed to a strange value. So initially the suspect was us OS guy :).
> And we used the 'approching" policy of the method, checked the kernel
> logs (we added many debug ones) before the corruption was detected, and
> found right before the corruption, there was a RTC runtime service
> calling record, and asked BIOS engineer to check, which root caused it.
>
> So the idea was to find the activites before the happening of "corrution",
> and check if there was some clues.

Again. You failed to structure the problem and use the tools correctly.

>> > Then I tried to recall some old memory corruption issues I've met before,
>> > and think about if there is some that could be captured by this method,
>> > one example was a static global array overflow issue, which corrupted 
>> > some other global variables which was next to it in kernel bss segment.
>> 
>> No. This is just all catching the problem after the fact with no trace
>> and conclusive information about the root cause. The tools are there,
>> you just have to use them correctly. But sure creating magic hacks which
>> by chance give you the same information is way better...
>
> This issue was interesting. It showed up as a NULL pointer panic, and I
> found it's a global variable (in bss segment) being corrupted (which shouldn't
> happen logically). As it didn't happened on normal platforms, but one platform
> with special config, we think it could be silicon related, and sent it to
> silicon team, who did root cause it with gathering/analyzing silicon traces to
> be an array overflow issue, as the special config make that array much longer. 

Your debug war stories are amazing, but in the wrong way and do not
justify to shove a completely ill defined barely usable hack into the
kernel to be maintained forever.

Thanks,

        tglx


  parent reply	other threads:[~2026-06-26  9:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-22  8:14 [PATCH v1] kernel: add a simple timer based software watchpoint Feng Tang
2026-06-22  8:42 ` David Hildenbrand (Arm)
2026-06-22 10:53   ` Thomas Gleixner
2026-06-22 12:45   ` Feng Tang
2026-06-22 14:13     ` David Hildenbrand (Arm)
2026-06-23  8:26       ` Feng Tang
2026-06-24  9:04         ` Thomas Gleixner
2026-06-24 10:21           ` David Hildenbrand (Arm)
2026-06-24 11:16             ` Feng Tang
2026-06-24 11:12           ` Feng Tang
2026-06-25 21:30             ` Thomas Gleixner
2026-06-26  1:56               ` Feng Tang
2026-06-26  2:57                 ` Feng Tang
2026-06-26  6:50                 ` Feng Tang
2026-06-26  9:16                 ` Thomas Gleixner [this message]
2026-06-26 14:33                   ` Feng Tang
2026-06-26 15:31                     ` Thomas Gleixner
2026-06-23 17:26 ` Julian Braha
2026-06-24  2:43   ` Feng Tang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyrlk6g7.ffs@fw13 \
    --to=tglx@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ardb+git@google.com \
    --cc=david@kernel.org \
    --cc=dianders@chromium.org \
    --cc=feng.tang@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rostedt@goodmis.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox