From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08C182F7EF4
	for <linux-kernel@vger.kernel.org>; Fri, 26 Jun 2026 09:16:11 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1782465373; cv=none; b=HNqs5o7ZIPII+fcLFjctF20yLODZQhByn1Uvs5HFCmWtNhgty0cA7jgLCYdHeKpo63GFcaykLaRKEh86gMmjwsRg4M6n/1y94ZBKvZ9B1BiZXIHohH52jGIi8al33Y42YOkneT4rHLyjTjZLPvTKGoHm/XTlxfY+yzEIqiU/BGo=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1782465373; c=relaxed/simple;
	bh=v4SxjslZ8Fj81C7hMt1qkjKnFuFG7fAzU+ZaOG90HOI=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=IMKyqRdInB1h2CQ7q1bQdu3i1Kk+CpvNfAxDWkcvMV6PjU0DbFFgllt7wDjS1ur8g9zh0lm/yQ5No3qI3iUTK37HsfYkCBtsXaFE2KTV2Vs5hqWTZfeo8UnXNn0rem0cZCJwWkGorwO5KMOF1KL0agK7vh7nBnJWCQtfMdcadpA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Tu/J0I2R; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Tu/J0I2R"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id DC9DD1F00A3A;
	Fri, 26 Jun 2026 09:16:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1782465371;
	bh=1yR/Amd/2SV32hDNvVnHlzb9NTcSIzVkjkhn0N8JD7I=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date;
	b=Tu/J0I2R/qrFpOB44VjirQmrQBDMHOYUWAHMa0jGsMKxvuBNRx2CTY+cBzf107Lmf
	 KkRMTaSwxtew0pqW/8QYxJZosnJveLOSfBQUVnhbHDprNpDaqK8fkNiIozQ/zilsXw
	 ycjzIiofcuRNZdX19OVinI9V9uJq9SYcmtYSE+G0YAFA4VxNqP62agPLiaMJ646rn0
	 rgZ/Bje0sTfG38bDfqCjK4iBSShwLKnOfH0gX9q4+BlV4CzTU15EMkLh2KSbLODGvv
	 UgBUVwdyv/hvfMXDZvF05QPr97oLPLPhkLwxWOW6Ylhph4iznlNYvITKlVSvMJ5HH+
	 rS4l2PGJltAqQ==
From: Thomas Gleixner <tglx@kernel.org>
To: Feng Tang <feng.tang@linux.alibaba.com>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>, Andrew Morton
 <akpm@linux-foundation.org>, Petr Mladek <pmladek@suse.com>, Steven
 Rostedt <rostedt@goodmis.org>, paulmck@kernel.org, Douglas Anderson
 <dianders@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Vlastimil
 Babka <vbabka@kernel.org>, linux-kernel@vger.kernel.org, Ard Biesheuvel
 <ardb+git@google.com>
Subject: Re: [PATCH v1] kernel: add a simple timer based software watchpoint
In-Reply-To: <aj3cRtxNmWYW48Wf@U-2FWC9VHC-2323.local>
References: <20260622081430.37557-1-feng.tang@linux.alibaba.com>
 <e59ca845-2134-45c5-ad31-5e4348bbbd5f@kernel.org>
 <ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local>
 <0c39c459-306f-49f5-b08e-e7b9b27b6352@kernel.org>
 <ajpDNxhOS-6l6LdP@U-2FWC9VHC-2323.local> <87a4skl36t.ffs@fw13>
 <aju7j9a2eyJdqQgt@U-2FWC9VHC-2323.local> <87pl1ejoj4.ffs@fw13>
 <aj3cRtxNmWYW48Wf@U-2FWC9VHC-2323.local>
Date: Fri, 26 Jun 2026 11:16:08 +0200
Message-ID: <87jyrlk6g7.ffs@fw13>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

On Fri, Jun 26 2026 at 09:56, Feng Tang wrote:
> On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
>> > ability to do the virtual to physical address translation instantly to
>> > watch a _physical_ address. So I guess, not able to watchpoint a physical
>> > address may be common for HW debuggers (I could be very wrong).
>> 
>> If the hardware debugger and the underlying CPU facility (ETM on ARM64
>> IIRC) does not support triggers on physical addresses and you already
>> concluded from other information that the problem is in the BIOS, then
>> tracing the kernel with it's virt/phys translation is not going to
>> work. You obviously have to use the BIOS translation which might be very
>> different, no?
>
> I didn't explain the issue clearly. The order for solving this issue was,
> we first used this method to halt (while (1) dead loop) the system when
> detecting the memory corruption,  silicon engineers gathered hardware
> traces, then root caused it. Before that, we didn't know it's a BIOS issue,
> as the initial symptom was random user space "segmentation fault"

Sure the initial symptom was a user space fault and you could not
explain it. But you really don't need your magic hack to figure out that
it tripped over a corrupted byte in the zero page or wherever.

Once you have that figured out and established that it's reproducible
then you add a watchpoint on that address in the kernel which won't
trigger. So that excludes the kernel and points to the BIOS, which in
turn makes you put a watchpoint on the BIOS translation.

If you need that hack to decode it, then you should rethink your
approach to structured problem analysis and deduction.
  
>> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
>> > we also used this method to solve one issue that BIOS runtime service
>> > corrupting ACPI_ENABLE register issue.
>> 
>> Again, if the BIOS runtime service changes virt/phys translation the you
>> have to trace the BIOS not the kernel. It's pretty obvious, no?
>  
> Similarly, I didn't make it clear that the issue was not about address
> translation.
>
> The bug report I got from test engineers was, the ACPI_ENABLE register
> has right value from BIOS boot message, and after booting to OS, it was
> changed to a strange value. So initially the suspect was us OS guy :).
> And we used the 'approching" policy of the method, checked the kernel
> logs (we added many debug ones) before the corruption was detected, and
> found right before the corruption, there was a RTC runtime service
> calling record, and asked BIOS engineer to check, which root caused it.
>
> So the idea was to find the activites before the happening of "corrution",
> and check if there was some clues.

Again. You failed to structure the problem and use the tools correctly.

>> > Then I tried to recall some old memory corruption issues I've met before,
>> > and think about if there is some that could be captured by this method,
>> > one example was a static global array overflow issue, which corrupted 
>> > some other global variables which was next to it in kernel bss segment.
>> 
>> No. This is just all catching the problem after the fact with no trace
>> and conclusive information about the root cause. The tools are there,
>> you just have to use them correctly. But sure creating magic hacks which
>> by chance give you the same information is way better...
>
> This issue was interesting. It showed up as a NULL pointer panic, and I
> found it's a global variable (in bss segment) being corrupted (which shouldn't
> happen logically). As it didn't happened on normal platforms, but one platform
> with special config, we think it could be silicon related, and sent it to
> silicon team, who did root cause it with gathering/analyzing silicon traces to
> be an array overflow issue, as the special config make that array much longer. 

Your debug war stories are amazing, but in the wrong way and do not
justify to shove a completely ill defined barely usable hack into the
kernel to be maintained forever.

Thanks,

        tglx