Re: [PATCH v1] kernel: add a simple timer based software watchpoint

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Feng Tang <feng.tang@linux.alibaba.com>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	paulmck@kernel.org, Douglas Anderson <dianders@chromium.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	linux-kernel@vger.kernel.org,
	Ard Biesheuvel <ardb+git@google.com>
Subject: Re: [PATCH v1] kernel: add a simple timer based software watchpoint
Date: Fri, 26 Jun 2026 11:16:08 +0200	[thread overview]
Message-ID: <87jyrlk6g7.ffs@fw13> (raw)
In-Reply-To: <aj3cRtxNmWYW48Wf@U-2FWC9VHC-2323.local>

On Fri, Jun 26 2026 at 09:56, Feng Tang wrote:
> On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
>> > ability to do the virtual to physical address translation instantly to
>> > watch a _physical_ address. So I guess, not able to watchpoint a physical
>> > address may be common for HW debuggers (I could be very wrong).
>> 
>> If the hardware debugger and the underlying CPU facility (ETM on ARM64
>> IIRC) does not support triggers on physical addresses and you already
>> concluded from other information that the problem is in the BIOS, then
>> tracing the kernel with it's virt/phys translation is not going to
>> work. You obviously have to use the BIOS translation which might be very
>> different, no?
>
> I didn't explain the issue clearly. The order for solving this issue was,
> we first used this method to halt (while (1) dead loop) the system when
> detecting the memory corruption,  silicon engineers gathered hardware
> traces, then root caused it. Before that, we didn't know it's a BIOS issue,
> as the initial symptom was random user space "segmentation fault"

Sure the initial symptom was a user space fault and you could not
explain it. But you really don't need your magic hack to figure out that
it tripped over a corrupted byte in the zero page or wherever.

Once you have that figured out and established that it's reproducible
then you add a watchpoint on that address in the kernel which won't
trigger. So that excludes the kernel and points to the BIOS, which in
turn makes you put a watchpoint on the BIOS translation.

If you need that hack to decode it, then you should rethink your
approach to structured problem analysis and deduction.
  
>> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
>> > we also used this method to solve one issue that BIOS runtime service
>> > corrupting ACPI_ENABLE register issue.
>> 
>> Again, if the BIOS runtime service changes virt/phys translation the you
>> have to trace the BIOS not the kernel. It's pretty obvious, no?
>  
> Similarly, I didn't make it clear that the issue was not about address
> translation.
>
> The bug report I got from test engineers was, the ACPI_ENABLE register
> has right value from BIOS boot message, and after booting to OS, it was
> changed to a strange value. So initially the suspect was us OS guy :).
> And we used the 'approching" policy of the method, checked the kernel
> logs (we added many debug ones) before the corruption was detected, and
> found right before the corruption, there was a RTC runtime service
> calling record, and asked BIOS engineer to check, which root caused it.
>
> So the idea was to find the activites before the happening of "corrution",
> and check if there was some clues.

Again. You failed to structure the problem and use the tools correctly.

>> > Then I tried to recall some old memory corruption issues I've met before,
>> > and think about if there is some that could be captured by this method,
>> > one example was a static global array overflow issue, which corrupted 
>> > some other global variables which was next to it in kernel bss segment.
>> 
>> No. This is just all catching the problem after the fact with no trace
>> and conclusive information about the root cause. The tools are there,
>> you just have to use them correctly. But sure creating magic hacks which
>> by chance give you the same information is way better...
>
> This issue was interesting. It showed up as a NULL pointer panic, and I
> found it's a global variable (in bss segment) being corrupted (which shouldn't
> happen logically). As it didn't happened on normal platforms, but one platform
> with special config, we think it could be silicon related, and sent it to
> silicon team, who did root cause it with gathering/analyzing silicon traces to
> be an array overflow issue, as the special config make that array much longer. 

Your debug war stories are amazing, but in the wrong way and do not
justify to shove a completely ill defined barely usable hack into the
kernel to be maintained forever.

Thanks,

        tglx

next prev parent reply	other threads:[~2026-06-26  9:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-22  8:14 [PATCH v1] kernel: add a simple timer based software watchpoint Feng Tang
2026-06-22  8:42 ` David Hildenbrand (Arm)
2026-06-22 10:53   ` Thomas Gleixner
2026-06-22 12:45   ` Feng Tang
2026-06-22 14:13     ` David Hildenbrand (Arm)
2026-06-23  8:26       ` Feng Tang
2026-06-24  9:04         ` Thomas Gleixner
2026-06-24 10:21           ` David Hildenbrand (Arm)
2026-06-24 11:16             ` Feng Tang
2026-06-24 11:12           ` Feng Tang
2026-06-25 21:30             ` Thomas Gleixner
2026-06-26  1:56               ` Feng Tang
2026-06-26  2:57                 ` Feng Tang
2026-06-26  6:50                 ` Feng Tang
2026-06-26  9:16                 ` Thomas Gleixner [this message]
2026-06-26 14:33                   ` Feng Tang
2026-06-26 15:31                     ` Thomas Gleixner
2026-06-23 17:26 ` Julian Braha
2026-06-24  2:43   ` Feng Tang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyrlk6g7.ffs@fw13 \
    --to=tglx@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ardb+git@google.com \
    --cc=david@kernel.org \
    --cc=dianders@chromium.org \
    --cc=feng.tang@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rostedt@goodmis.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.