From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C8FECCDB47C for ; Wed, 24 Jun 2026 10:40:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:MIME-Version: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=67dLv1ARffmQNc6drLJbLB9TAVgbqaZwOnvrBi9FNR4=; b=bk0A/YvQXCcf8FDW0I6CW8HLXR Ebh2IfIA+YfakusDi5NI0LdpVSzrw86skKOTEIO8Ywzd8Oz45BN9ZPZnFA3Q9WzuOYRPfp5K0l7vY GdS2Cr0GNltYPTFCaGwDe1IOZIMqT5VLGwV5ic0DOTmzN+W7a3MalYQd0XSwtSn70ambZpa8kZ6g2 SW8oBGbmBvWiARYsxrieMyTiy6Z53MWPn8OQhaJvVBImI6Dy82lID76aLf7Vy8NMTIicOUXme4MAC gcTedD6dkF+ZntOKxu8UeYEIg86aBrxqYZ+9Pf1bNxNFXuS7D0MrhrQ+Fx14+6ImhagcNP/XC4+l+ P4ZWsa0g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.99.1 #2 (Red Hat Linux)) id 1wcL1h-00000007cQa-02zV; Wed, 24 Jun 2026 10:40:13 +0000 Received: from stravinsky.debian.org ([2001:41b8:202:deb::311:108]) by bombadil.infradead.org with esmtps (Exim 4.99.1 #2 (Red Hat Linux)) id 1wcL1e-00000007cNn-2NQO for kexec@lists.infradead.org; Wed, 24 Jun 2026 10:40:12 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=67dLv1ARffmQNc6drLJbLB9TAVgbqaZwOnvrBi9FNR4=; b=XlcuTTL1Av9lAVIkaKrdeA2nhx B2YDzxqHpmfT6WVITgcX4+PMlKRf/KNGWaSmn4nMdsr4J10VYjuIJs5mStnsPgitvTu3h14MrzsCn OgVYVvwWm5jl8Wr+/MwYXAe4sYzJBB2W51V6ICKSx0D0h/6pDdQWvHuhrFp55xuGW/3HlDn0MuzFw nMUsU+YSccJkPsawAMe+2s7MYHhhaA7k0VSA5TXJ5wOIf/bF3EEkswH80N0IqfSToLqY5zvdx9q0H gyxlg+halBsfr1+oO9mjexIkJBd5BThK8CkunM5eR3BXMT3HKGASMhy9lstNpzYL50SHS70E8NDl8 gBwCqSnw==; Received: from authenticated-user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.96) (envelope-from ) id 1wcL1E-002MpD-1q; Wed, 24 Jun 2026 10:39:44 +0000 Date: Wed, 24 Jun 2026 03:39:38 -0700 From: Breno Leitao To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, david@kernel.org, lance.yang@linux.dev, akpm@linux-foundation.org, baoquan.he@linux.dev, rppt@kernel.org, pratyush@kernel.org Cc: kexec@lists.infradead.org, linux-mm@kvack.org, rneu@meta.com, riel@surriel.com, caggio@meta.com, kas@kernel.org Subject: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC] Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Debian-User: leitao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.9.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260624_034010_720721_9AEBC430 X-CRM114-Status: GOOD ( 14.66 ) X-BeenThere: kexec@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "kexec" Errors-To: kexec-bounces+kexec=archiver.kernel.org@lists.infradead.org TL;DR: carry the hardware-poisoned page list across kexec using KHO, so the next kernel doesn't hand out known-bad RAM. The problem =========== When a page is hard-offlined due to an uncorrectable memory error (multi bit ECC), memory_failure() sets PG_hwpoison, unmaps it, removes it from the buddy allocator and accounts it in num_poisoned_pages / HardwareCorrupted. All of that lives in the running kernel's data structures, not in the hardware. A kexec replaces the kernel image but not the physical DRAM. The next kernel rebuilds mem_map[] from the firmware-provided memory map, sees the bad frame as ordinary system RAM, and the buddy allocator hands it back out on the next kernel. The known-bad cell is silently back in circulation, and the next access faults again - potentially in a context that is harder to recover than the original (e.g. a kernel allocation rather than a killable user page). This matters most where kexec is frequent and machines are long-lived: live kernel update on large fleets. Poison knowledge accumulates over uptime and is thrown away on every update. This is the case at Meta and in many hyperscalers. Possible solutions ================== 1. Do nothing (status quo). The next kernel hands out known-bad RAM, and hope for the best. 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the frame would simply never become RAM (no allocator race at all). But: it is x86-only (no arm64 equivalent in the same mechanism; this series is tested on arm64); 3. Firmware / platform page retirement (PPR, BMC page-offline, CXL device poison lists). This is the correct layer for *cross power cycle* persistence and is complementary to this work. But it is per-platform, out of OS control, not universally available, and cannot carry OS-discovered or software-simulated poison (MADV_HWPOISON, the injector). kexec can also happen long before firmware retirement takes effect. 4. reserve_mem= / memmap= on the command line. Automatically sent reserved_mem= for the next kexec kernel cmdline. 5. A bespoke kexec segment / setup_data blob. This reinvents what KHO already provides - preserved memory plus an FDT handoff to the next kernel - which is the upstream-blessed generic mechanism for exactly this kind of state. This PoC ======== * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec HandOver) to carry the poison list between kernels. * Producer: hooks num_poisoned_pages_inc()/_sub() - the single chokepoint for every poison/unpoison event - and records each poisoned PFN into a vmalloc array that KHO preserves across the kexec, described by a small versioned "hwpoison" subtree. * Consumer: early in the next boot (fs_initcall_sync, before the buddy allocator has handed anything out) it restores that array and re-runs memory_failure() on each PFN, re-offlining the frame and rebuilding the full hwpoison state (PG_hwpoison, counters, HardwareCorrupted). * The replay feeds back through the producer, so the list re-publishes itself and survives an arbitrary chain of kexecs. Open questions ============== * Is there any alternative I am not seeing? * Is a dedicated "hwpoison" subtree the right granularity, or should this live under a broader RAS/KHO umbrella? * Trusting the inherited list: should the next kernel bound the count / validate PFNs against its own memory map before replaying? Limitations =========== * Poison events before KHO init (fs_initcall) cannot be published; academic in practice as MCEs do not fire that early. * Per-page only. Cross-power-cycle retirement of a whole DIMM is not covered. I've got a PoC working, and it is available in here, in case you are interested in the details I am playing with https://github.com/leitao/linux/tree/b4/hwpoison