From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08C182F7EF4 for ; Fri, 26 Jun 2026 09:16:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782465373; cv=none; b=HNqs5o7ZIPII+fcLFjctF20yLODZQhByn1Uvs5HFCmWtNhgty0cA7jgLCYdHeKpo63GFcaykLaRKEh86gMmjwsRg4M6n/1y94ZBKvZ9B1BiZXIHohH52jGIi8al33Y42YOkneT4rHLyjTjZLPvTKGoHm/XTlxfY+yzEIqiU/BGo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782465373; c=relaxed/simple; bh=v4SxjslZ8Fj81C7hMt1qkjKnFuFG7fAzU+ZaOG90HOI=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=IMKyqRdInB1h2CQ7q1bQdu3i1Kk+CpvNfAxDWkcvMV6PjU0DbFFgllt7wDjS1ur8g9zh0lm/yQ5No3qI3iUTK37HsfYkCBtsXaFE2KTV2Vs5hqWTZfeo8UnXNn0rem0cZCJwWkGorwO5KMOF1KL0agK7vh7nBnJWCQtfMdcadpA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Tu/J0I2R; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Tu/J0I2R" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DC9DD1F00A3A; Fri, 26 Jun 2026 09:16:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782465371; bh=1yR/Amd/2SV32hDNvVnHlzb9NTcSIzVkjkhn0N8JD7I=; h=From:To:Cc:Subject:In-Reply-To:References:Date; b=Tu/J0I2R/qrFpOB44VjirQmrQBDMHOYUWAHMa0jGsMKxvuBNRx2CTY+cBzf107Lmf KkRMTaSwxtew0pqW/8QYxJZosnJveLOSfBQUVnhbHDprNpDaqK8fkNiIozQ/zilsXw ycjzIiofcuRNZdX19OVinI9V9uJq9SYcmtYSE+G0YAFA4VxNqP62agPLiaMJ646rn0 rgZ/Bje0sTfG38bDfqCjK4iBSShwLKnOfH0gX9q4+BlV4CzTU15EMkLh2KSbLODGvv UgBUVwdyv/hvfMXDZvF05QPr97oLPLPhkLwxWOW6Ylhph4iznlNYvITKlVSvMJ5HH+ rS4l2PGJltAqQ== From: Thomas Gleixner To: Feng Tang Cc: "David Hildenbrand (Arm)" , Andrew Morton , Petr Mladek , Steven Rostedt , paulmck@kernel.org, Douglas Anderson , Peter Zijlstra , Vlastimil Babka , linux-kernel@vger.kernel.org, Ard Biesheuvel Subject: Re: [PATCH v1] kernel: add a simple timer based software watchpoint In-Reply-To: References: <20260622081430.37557-1-feng.tang@linux.alibaba.com> <0c39c459-306f-49f5-b08e-e7b9b27b6352@kernel.org> <87a4skl36t.ffs@fw13> <87pl1ejoj4.ffs@fw13> Date: Fri, 26 Jun 2026 11:16:08 +0200 Message-ID: <87jyrlk6g7.ffs@fw13> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Fri, Jun 26 2026 at 09:56, Feng Tang wrote: > On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote: >> > ability to do the virtual to physical address translation instantly to >> > watch a _physical_ address. So I guess, not able to watchpoint a physical >> > address may be common for HW debuggers (I could be very wrong). >> >> If the hardware debugger and the underlying CPU facility (ETM on ARM64 >> IIRC) does not support triggers on physical addresses and you already >> concluded from other information that the problem is in the BIOS, then >> tracing the kernel with it's virt/phys translation is not going to >> work. You obviously have to use the BIOS translation which might be very >> different, no? > > I didn't explain the issue clearly. The order for solving this issue was, > we first used this method to halt (while (1) dead loop) the system when > detecting the memory corruption, silicon engineers gathered hardware > traces, then root caused it. Before that, we didn't know it's a BIOS issue, > as the initial symptom was random user space "segmentation fault" Sure the initial symptom was a user space fault and you could not explain it. But you really don't need your magic hack to figure out that it tripped over a corrupted byte in the zero page or wherever. Once you have that figured out and established that it's reproducible then you add a watchpoint on that address in the kernel which won't trigger. So that excludes the kernel and points to the BIOS, which in turn makes you put a watchpoint on the BIOS translation. If you need that hack to decode it, then you should rethink your approach to structured problem analysis and deduction. >> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/, >> > we also used this method to solve one issue that BIOS runtime service >> > corrupting ACPI_ENABLE register issue. >> >> Again, if the BIOS runtime service changes virt/phys translation the you >> have to trace the BIOS not the kernel. It's pretty obvious, no? > > Similarly, I didn't make it clear that the issue was not about address > translation. > > The bug report I got from test engineers was, the ACPI_ENABLE register > has right value from BIOS boot message, and after booting to OS, it was > changed to a strange value. So initially the suspect was us OS guy :). > And we used the 'approching" policy of the method, checked the kernel > logs (we added many debug ones) before the corruption was detected, and > found right before the corruption, there was a RTC runtime service > calling record, and asked BIOS engineer to check, which root caused it. > > So the idea was to find the activites before the happening of "corrution", > and check if there was some clues. Again. You failed to structure the problem and use the tools correctly. >> > Then I tried to recall some old memory corruption issues I've met before, >> > and think about if there is some that could be captured by this method, >> > one example was a static global array overflow issue, which corrupted >> > some other global variables which was next to it in kernel bss segment. >> >> No. This is just all catching the problem after the fact with no trace >> and conclusive information about the root cause. The tools are there, >> you just have to use them correctly. But sure creating magic hacks which >> by chance give you the same information is way better... > > This issue was interesting. It showed up as a NULL pointer panic, and I > found it's a global variable (in bss segment) being corrupted (which shouldn't > happen logically). As it didn't happened on normal platforms, but one platform > with special config, we think it could be silicon related, and sent it to > silicon team, who did root cause it with gathering/analyzing silicon traces to > be an array overflow issue, as the special config make that array much longer. Your debug war stories are amazing, but in the wrong way and do not justify to shove a completely ill defined barely usable hack into the kernel to be maintained forever. Thanks, tglx