From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6352B39020C for ; Mon, 8 Jun 2026 08:03:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.50 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780905792; cv=none; b=WZeIP5Re9H+C2DUF2oduiDEm9akQNCMIgQtFlRd190fUVLvlUWRVQ9vR01sML/mB7xUiaEg+BBH7F4qJmCy0s+nSQ3wkKhKUWvaUiXq8P95YAUzstH4yMvlhvhqqEa16L3Mp25Qa6KQXI7QxTicu+bbtj+jyJeMBtfED40ZRxQU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780905792; c=relaxed/simple; bh=Rp73xS1md1ag18xqzyTD8AhbhhCcidOWH5E5mw+NtBk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Jafb3OveuKsX8N5ZE7pTWSUdYhlmPh03e6YL/xKxmhCsvhYytU4jW8Nxmb4mqPum25I4b/G8I4xdfCuzWSLrAFBBTJd6Pgi6nnCVim2XACYbWQMVPgTlNe/hzs6OtBH15CxrUeeIyJznv9flmlLZB5/gYhhEvSR270ukl3A5zLw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=IsTk7W40; arc=none smtp.client-ip=209.85.221.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="IsTk7W40" Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-45fd464d51fso2096726f8f.3 for ; Mon, 08 Jun 2026 01:03:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1780905789; x=1781510589; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=3eAv2lR+zmlHqGUwbozmfquuRaKMTBo7VBDnKrFqn+c=; b=IsTk7W40oY45VaMPgn53Dyj/sBXl8UCFYhwOSO2IclxwcfrOzDJln/hHQd5ho8oYHh gLu8lAKrmmvi2vRybwpyQGvf7UNrY+Dpm8Se9upvIUMZKXD1OU+h7/0Grw0A2+MYShg4 O6joVxO9g99f8s5cnn2FvhDqAuHjgeYBChEOPspgdbHNrPY7Kx2N/qaKO9d/k9ghY5oK lmrdrAnnUnRlf7k7jeIZXEEqIpP/XXdlZr1bXZNT6pmeT8tsKcRk1uwZG4ZlZsVjTBwN d2Tb6/cIxlTNSXQq/UR8cobzxIJJvBUZ0ckc302PlcV2cIA3IaDffSFIbArUEe7iUURZ 40mQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780905789; x=1781510589; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3eAv2lR+zmlHqGUwbozmfquuRaKMTBo7VBDnKrFqn+c=; b=pFEY087pC7Gx5evuw6WI0qQ++DWnOPCA1eUHgbxoFYa6+EN96jRbus9/nSRGYYrmM5 XqPAZnChOS4UkRnZSHq6zSimk1O0xYSQaLOB1SrmPx+/zKUJG0O7rH8DefpMwLuu7zH0 pSk/NH2oS3uufUSTVTw9+OFRiiERHX8uF391ouMYMYvG75lHBOVQLxwV+Lzbqas2vRn3 4RpUAynKxwVnCzTm30XpqPEJ4tRTeQEnHtE79mrK3NdqX8Vf/GFVmY9fjspd6VLp/kWW XoRWd6Yz7PWdBSXQ4URSj2+HwadaFZfKYf11QXTM+zcpt7cKeftmmMfH8M2IfXqexoBD 0R0w== X-Forwarded-Encrypted: i=1; AFNElJ8y9cAG5uYGgT+er1zsfqqd3Ff5FxK4xR7NO8stGI7LUYYwM250kzyfOusAdM1bkbeOqWgjQTaxfiXgsvU=@vger.kernel.org X-Gm-Message-State: AOJu0Yx0l6GXxX66/7Cj+f2HvzqM2abtDOZk+ioUOmFCHLyjk9smIsqh TFFv+tu9wa8gnqvGkwbHTyd9WPbkKvK5s2/5eKNNasV95uOJ9Ty7+7rkYLH0opChlkg= X-Gm-Gg: Acq92OESLHMqYcMEQYFVml32yyDLoy9mLFmSm+gB8ePGwnqjDtiFONrVS7eCPYm0kMT k/M1JxTkS76n0rNc8k2GUy+N4em/LST8gj56xgaHEq2t3MqnO33mfv3YvmijfTn8V5jLxl3FGbx HOiGSF6B4lEpJtv3wHM9p17yjJ2cgqw07BD2LP5hCIJK5hOQZZ3tq5HhX7PR8Zy3nkbkgkgHYd9 l/2WyfHoEntnyXTUxe/axT+OzJdFWnLDRKCa2wAtDUVsa+0E3310ORb/c1ctRJ2elytKweY9JVG KoOl9o8SU7/joA3MI+g30agPpZyKF6wsZqERwZHpuilw8iT8MMrLKxeedI7/QqkCgUbR2aydns5 +L7HHKhOP2onT7t2vBKQMIYjj/AfnQbzBl1UAHtA18EylTHddmVF6W3AmgWI18o6JYX/i2qJAnz HH1XW0EleoAtdvggWmfwpXRt9uYgMFuzLHN20p X-Received: by 2002:a05:600c:1c13:b0:490:a646:9d75 with SMTP id 5b1f17b1804b1-490c25acd23mr258941665e9.9.1780905788713; Mon, 08 Jun 2026 01:03:08 -0700 (PDT) Received: from pathway.suse.cz ([176.114.240.130]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-490bc23394asm368963415e9.0.2026.06.08.01.03.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jun 2026 01:03:08 -0700 (PDT) Date: Mon, 8 Jun 2026 10:03:05 +0200 From: Petr Mladek To: Feng Tang Cc: Andrew Morton , Steven Rostedt , paulmck@kernel.org, linux-kernel@vger.kernel.org, Douglas Anderson , Thomas Gleixner , Peter Zijlstra , Vlastimil Babka Subject: Re: [PATCH] lib/sys_info: add a simple timer based memory corruption detector Message-ID: References: <20260527034324.51136-1-feng.tang@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260527034324.51136-1-feng.tang@linux.alibaba.com> Added few more people into Cc. On Wed 2026-05-27 11:43:24, Feng Tang wrote: > During debugging some bios/hardware related nasty memory corruption > issues, we found using periodic timer to monitor specific dram/mmio > physical address is very useful for debugging, which acts like > a basic software watchpoint. > > For those bugs, who (and when) change(corrupt) those dram or mmio > register is hard to trace, and sometimes even hardware jtag debugger > can't help (say the physical address watchpoint doesn't work). It seems that this approach helped you to debug a nasty problem. I am not sure why the other ways did not work. Could you please provide some more information about the use case? Ideally, please describe one particular situation where it helped. What was the bug, how it manifested, and how the crash dump helped to analyze it. Feel free to use generic names, like graphics card, or ssd disk, instead of exact producer names, ... > The biggest shortcoming is it can never capture the exact point like > a hardware watchpoint, no matter how small the timer interval is set, > the idea is trying to approach the point, hoping the caught context > have enough debug info (which did help us in solving bios/hardware > bugs) > > The working flow is simple: after suspected address is identified, > start periodic timer polling it to catch if its value is changed to > target 'magic' value, then halt the cpu (better limit to have only > one cpu online), or panic, or print out system information, so that > the error environment is frozen for further check , or let > kexec/kdump to record the vmore, etc. > > All the settings are module parameters: > > watch_interval_ms: SW watchpoint check interval in ms > paddr_dram_to_watch: Physical dram address to monitor. > target_dram_val: Expected value at the dram address that triggers the watchpoint. > paddr_mmio_to_watch: Physical mmio address to monitor. Must be 32-bit aligned. > target_mmio_val: Expected value at the mmio address that triggers the watchpoint. > panic_on_hit: Trigger kernel panic when watchpoint condition hits. > hang_on_hit: halt the CPU (wait for HW debugger) > > This RFC is trying to show the idea and get feedback, and there are > some todos: > * merge the dram/mmio interface to auto detect it's dram or mmio > * support runtime changing the address > * move the starting point earlier in boot phase > * currently is monitoring 'changing to a value', add support > for 'changing from a value' Sashiko AI has pointed out several possible problems, see https://sashiko.dev/#/patchset/20260527034324.51136-1-feng.tang%40linux.alibaba.com > --- a/lib/sys_info.c > +++ b/lib/sys_info.c I we agreed that this feature would be useful then it would deserve its own source file. IMHO, it fits into the watchdog category. I would put it into kernel/watch_mem or so. Best Regards, Petr > @@ -164,3 +164,107 @@ void sys_info(unsigned long si_mask) > { > __sys_info(si_mask ? : kernel_si_mask); > } > + > +#ifdef CONFIG_SW_WATCHPOINT > + > +/* default 100 ms interval */ > +static unsigned long watch_interval_ms = 100; > +module_param(watch_interval_ms, ulong, 0644); > +MODULE_PARM_DESC(watch_interval_ms, "SW watchpoint check interval in ms"); > + > +static unsigned long paddr_dram_to_watch; > +module_param(paddr_dram_to_watch, ulong, 0644); > +MODULE_PARM_DESC(paddr_dram_to_watch, "Physical DRAM address to watch"); > + > +static unsigned long *vaddr_dram; > + > +static unsigned long target_dram_val; > +module_param(target_dram_val, ulong, 0644); > +MODULE_PARM_DESC(target_dram_val, "Target DRAM value to trigger watchpoint"); > + > +/* The MMIO address should be 32b aligned */ > +static unsigned long paddr_mmio_to_watch; > +module_param(paddr_mmio_to_watch, ulong, 0644); > +MODULE_PARM_DESC(paddr_mmio_to_watch, "Physical MMIO address to watch (32bit aligned)"); > + > +static unsigned int *vaddr_mmio; > + > +static unsigned int target_mmio_val; > +module_param(target_mmio_val, uint, 0644); > +MODULE_PARM_DESC(target_mmio_val, "Target MMIO value to trigger watchpoint"); > + > +static bool panic_on_hit; > +module_param(panic_on_hit, bool, 0644); > +MODULE_PARM_DESC(panic_on_hit, "Panic when watchpoint hits"); > + > +static bool hang_on_hit; > +module_param(hang_on_hit, bool, 0644); > +MODULE_PARM_DESC(hang_on_hit, "Hang when watchpoint hits"); > + > +/* Stop the watchpoint timer after first hit */ > +static bool check_once = true; > +module_param(check_once, bool, 0644); > +MODULE_PARM_DESC(check_once, "Stop watching after first hit"); > + > +static struct timer_list sw_watchpoint_timer; > + > +static void sw_watchpoint_timer_fn(struct timer_list *unused) > +{ > + bool hit = false; > + > + if (vaddr_mmio && (*vaddr_mmio == target_mmio_val)) { > + pr_info("MMIO [@0x%lx] hit the target value [0x%x]!\n", > + paddr_mmio_to_watch, target_mmio_val); > + hit = true; > + } > + > + if (vaddr_dram && (*vaddr_dram == target_dram_val)) { > + pr_info("DRAM [@0x%lx] hit the target value [0x%lx]!\n", > + paddr_dram_to_watch, target_dram_val); > + hit = true; > + } > + > + if (hit) { > + sys_info(0); > + > + /* Useful for attaching HW debugger */ > + if (hang_on_hit) { > + pr_warn("Will dead loop on this CPU\n"); > + while (1); > + } > + > + /* Could be used to trigger kexec/kdump */ > + if (panic_on_hit) > + panic("SW watchpoint hit!"); > + > + if (check_once) > + return; > + } > + > + mod_timer(&sw_watchpoint_timer, jiffies + msecs_to_jiffies(watch_interval_ms)); > +} > + > +static int __init sw_watchpoint_timer_init(void) > +{ > + if (paddr_mmio_to_watch) { > + vaddr_mmio = ioremap(paddr_mmio_to_watch & PAGE_MASK, PAGE_SIZE); > + if (!vaddr_mmio) > + return -ENOMEM; > + > + vaddr_mmio += (paddr_mmio_to_watch % PAGE_SIZE) / 4; > + } > + > + if (paddr_dram_to_watch) { > + vaddr_dram = phys_to_virt(paddr_dram_to_watch); > + if (!vaddr_dram) > + return -ENOMEM; > + } > + > + timer_setup(&sw_watchpoint_timer, sw_watchpoint_timer_fn, 0); > + sw_watchpoint_timer.expires = jiffies + msecs_to_jiffies(watch_interval_ms); > + add_timer(&sw_watchpoint_timer); > + > + return 0; > +} > +core_initcall(sw_watchpoint_timer_init); > +#endif > > base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d > -- > 2.39.5 (Apple Git-154)