From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF4F832720C for ; Wed, 10 Jun 2026 16:50:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=216.40.44.16 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781110219; cv=none; b=pBlL2Q2TFwBs8Utl6EUKEhiNB5eHgsjUJhDIbgsvm/rtK6g2vF1n8cK/fEVcscnDnT1/3w2g9Wzdm0KtQG/r+CohHBVQwsSW2W5bp4wjpih+f3e5ghZlzpb3dARgvF+AMBUdI5CcxOdre52asoXIwAibKaA05JBtd4uissh1fVE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781110219; c=relaxed/simple; bh=PCg9xpEmjV9lEMt2bsd0KGRtJd/nZxptykT5/OuhUkc=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=jGvjRJto2s3nUsXDaKWtwx9uLmui8L+uPWSQbYgdHomnppCvH7GfSuTzg6B1iw4kpMHbVwd5dLTk4MxvQXAIXRGwnnZBQTzA34w5qecSlVsi+kWdG2st4J2yc7vwkV8b1Bn3xIx56NswJbyuKph2db8fpgfn7WmLzjBP6z4SQ/M= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org; spf=pass smtp.mailfrom=goodmis.org; arc=none smtp.client-ip=216.40.44.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=goodmis.org Received: from omf20.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6D292C10E0; Wed, 10 Jun 2026 16:50:16 +0000 (UTC) Received: from [HIDDEN] (Authenticated sender: rostedt@goodmis.org) by omf20.hostedemail.com (Postfix) with ESMTPA id B4BA92002A; Wed, 10 Jun 2026 16:50:14 +0000 (UTC) Date: Wed, 10 Jun 2026 12:50:13 -0400 From: Steven Rostedt To: Feng Tang Cc: Andrew Morton , Petr Mladek , paulmck@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] lib/sys_info: add a simple timer based memory corruption detector Message-ID: <20260610125013.4cd775e3@robin> In-Reply-To: <20260527034324.51136-1-feng.tang@linux.alibaba.com> References: <20260527034324.51136-1-feng.tang@linux.alibaba.com> X-Mailer: Claws Mail 4.4.0 (GTK 3.24.52; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Stat-Signature: ogh8put3m3jwdpex63t5w8ctrzitqmhj X-Rspamd-Server: rspamout08 X-Rspamd-Queue-Id: B4BA92002A X-Session-Marker: 726F737465647440676F6F646D69732E6F7267 X-Session-ID: U2FsdGVkX1/1pOfB2clljN8t658vRqj2pC29nMq/WAE= X-HE-Tag: 1781110214-659516 X-HE-Meta: U2FsdGVkX182zPoSmQy8qm4x90poCYvEDJtXtoZ9BYazyQc+qfXQg4HMv7jkwdjIpXnE8cf0sQkSpG4hOIjzmm0H0feVZlfe9SSbIrq/BAdogDn8tZHnfGfDLs0UC9O414f/Ajf9vSkkWWr912zFMn2IqSOs97KUDVpBC9XKVQmKDQSNQE9ngPIkVtln8OcArjsLf8NCSBDd+8vRFohK9s3R6mywk8mBrKgYenh0KAxUDMAWtVy05yf0c3GwMWnTCmfZbtoCA+vIngjtDxXBTO6QEF5yWqzjxUkTirryEgYbrY6J0BWDj3rJjOb+R7JxIuR9JX9W7Qz43GWSKic7oY5AFUg0ICwpw596tjgTKMkCZlFaG3woJQLGTOlbnJiYSt7uuHzDvRZMabHP6zmzYA6j8QGk1HnNxp+XSa6ljXw= On Wed, 27 May 2026 11:43:24 +0800 Feng Tang wrote: > During debugging some bios/hardware related nasty memory corruption > issues, we found using periodic timer to monitor specific dram/mmio > physical address is very useful for debugging, which acts like > a basic software watchpoint. > > For those bugs, who (and when) change(corrupt) those dram or mmio > register is hard to trace, and sometimes even hardware jtag debugger > can't help (say the physical address watchpoint doesn't work). > > The biggest shortcoming is it can never capture the exact point like > a hardware watchpoint, no matter how small the timer interval is set, > the idea is trying to approach the point, hoping the caught context > have enough debug info (which did help us in solving bios/hardware > bugs) Instead of using a timer, can't you use the function tracer? That is, have the value checked at *every* function call. You can easily add a custom callback that gets called when every function is executed. See https://docs.kernel.org/trace/ftrace-uses.html > > The working flow is simple: after suspected address is identified, > start periodic timer polling it to catch if its value is changed to > target 'magic' value, then halt the cpu (better limit to have only > one cpu online), or panic, or print out system information, so that > the error environment is frozen for further check , or let > kexec/kdump to record the vmore, etc. > > All the settings are module parameters: > > watch_interval_ms: SW watchpoint check interval in ms > paddr_dram_to_watch: Physical dram address to monitor. > target_dram_val: Expected value at the dram address that triggers the watchpoint. > paddr_mmio_to_watch: Physical mmio address to monitor. Must be 32-bit aligned. > target_mmio_val: Expected value at the mmio address that triggers the watchpoint. > panic_on_hit: Trigger kernel panic when watchpoint condition hits. > hang_on_hit: halt the CPU (wait for HW debugger) > > This RFC is trying to show the idea and get feedback, and there are > some todos: > * merge the dram/mmio interface to auto detect it's dram or mmio > * support runtime changing the address > * move the starting point earlier in boot phase > * currently is monitoring 'changing to a value', add support > for 'changing from a value' > > Signed-off-by: Feng Tang > +static void sw_watchpoint_timer_fn(struct timer_list *unused) > +{ > + bool hit = false; > + > + if (vaddr_mmio && (*vaddr_mmio == target_mmio_val)) { > + pr_info("MMIO [@0x%lx] hit the target value [0x%x]!\n", > + paddr_mmio_to_watch, target_mmio_val); > + hit = true; > + } > + > + if (vaddr_dram && (*vaddr_dram == target_dram_val)) { > + pr_info("DRAM [@0x%lx] hit the target value [0x%lx]!\n", > + paddr_dram_to_watch, target_dram_val); > + hit = true; > + } > + > + if (hit) { > + sys_info(0); > + > + /* Useful for attaching HW debugger */ > + if (hang_on_hit) { > + pr_warn("Will dead loop on this CPU\n"); > + while (1); > + } > + > + /* Could be used to trigger kexec/kdump */ > + if (panic_on_hit) > + panic("SW watchpoint hit!"); > + > + if (check_once) > + return; > + } The above function would be: static bool no_check; static void sw_watchpoint_fn(unsigned long ip, unsigned long pip, struct ftrace_ops *op, struct ftrace_regs *fregs) { bool hit = false; int bit; if (no_check) return; bit = ftrace_test_recursion_trylock(ip, parent_ip); if (bit < 0) return; [ do the above code ] if (check_once) { // possibly call irq_work to unregister ftrace no_check = true; } ftrace_test_recursion_unlock(bit); } static struct ftrace_ops sw_watchpoint_fops = { .func = sw_watchpoint_fn; }; [..] > + > + mod_timer(&sw_watchpoint_timer, jiffies + msecs_to_jiffies(watch_interval_ms)); > +} > + > +static int __init sw_watchpoint_timer_init(void) > +{ > + if (paddr_mmio_to_watch) { > + vaddr_mmio = ioremap(paddr_mmio_to_watch & PAGE_MASK, PAGE_SIZE); > + if (!vaddr_mmio) > + return -ENOMEM; > + > + vaddr_mmio += (paddr_mmio_to_watch % PAGE_SIZE) / 4; > + } > + > + if (paddr_dram_to_watch) { > + vaddr_dram = phys_to_virt(paddr_dram_to_watch); > + if (!vaddr_dram) > + return -ENOMEM; > + } > + > + timer_setup(&sw_watchpoint_timer, sw_watchpoint_timer_fn, 0); > + sw_watchpoint_timer.expires = jiffies + msecs_to_jiffies(watch_interval_ms); > + add_timer(&sw_watchpoint_timer); Instead of the above, have: register_ftrace_function(&sw_watchpoint_fops); -- Steve > + > + return 0; > +} > +core_initcall(sw_watchpoint_timer_init); > +#endif > > base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d