From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60CB7135A53 for ; Sat, 11 Oct 2025 12:58:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760187490; cv=none; b=H//EOE4SF8RWY1oOu929hc+0DW8g10patlpMdWaJPxrr+1Q5rjJrPTLbbIyCHJKYM1RgRAy692IexJknh1U0suhvYFEFZJtK569/hoclMiQhN22vqeBcuYBEhps0YmN4BNvDNIXrWxCVNi0R6gRpXt7dwlbeMgSIUM3qcXLKz5A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760187490; c=relaxed/simple; bh=89rRaiWPsApdm1cw1YJX21z9mIPg1mtxgIKLOJ//pho=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=ZyeP4ByCFKK1Jg05kaaaaiTG3qswnwV6yKhDQJHPjBAjovbRUkh36ejuYafyMoJKubYng6XQMrwuBk6NXb1kyMmWrFnrwFiotRS5s4dedXfQxX+zOraoFDeUeQSRiDLe80aUB1GmiC6bRXfPTa+/Nt4Bz/MyP0IKH8eo5+9GUXc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=awgSJglI; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="awgSJglI" Message-ID: <3954ac60-e818-42a0-b114-c2a09d34572b@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1760187481; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LI0ebVFo/IUV6FiZoBk7dM1G5eL/sv9Wz3HHgolsQNA=; b=awgSJglISvkYKaa2u26ASTX6O+M9NeM5gMIRbMNxNe7W8d0sIawfQtlDrulh+uSezLqxf6 3z3u8QfOecJBC9sm2cdNp6tubyzDINOzFT9g1KltE5C43Lx2XY+ucbxL8omDwCM9deCX61 i+jYigqjvm8R1wYe/gx02vm5XwIJJ5c= Date: Sat, 11 Oct 2025 20:57:54 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang To: Miaohe Lin , qiuxu.zhuo@intel.com Cc: Longlong Xia , nao.horiguchi@gmail.com, akpm@linux-foundation.org, wangkefeng.wang@huawei.com, xu.xin16@zte.com.cn, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Longlong Xia , david@redhat.com References: <20251009070045.2011920-1-xialonglong2025@163.com> <20251009070045.2011920-2-xialonglong2025@163.com> <55370eb6-9798-0f46-2301-d5f66528411b@huawei.com> <077882e3-f69f-44f3-aa74-b325721beb42@linux.dev> In-Reply-To: <077882e3-f69f-44f3-aa74-b325721beb42@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT Cc Qiuxu On 2025/10/11 17:38, Lance Yang wrote: > > > On 2025/10/11 17:23, Miaohe Lin wrote: >> On 2025/10/11 15:52, Lance Yang wrote: >>> @Miaohe >>> >>> I'd like to raise a concern about a potential hardware failure :) >> >> Thanks for your thought. >> >>> >>> My tests show that if the shared zeropage (or huge zeropage) gets marked >>> with HWpoison, the kernel continues to install it for new mappings. >>> Surprisingly, it does not kill the accessing process ... >> >> Have you investigated the cause? If user space writes to shared zeropage, >> it will trigger COW and a new page will be installed. After that, reading >> the newly allocated page won't trigger memory error. In this scene, it >> does >> not kill the accessing process. > > Not write just read :) > >> >>> >>> The concern is, once the page is no longer zero-filled due to the >>> hardware >>> failure, what will happen? Would this lead to silent data corruption for >>> applications that expect to read zeros? >> >> IMHO, once the page is no longer zero-filled due to the hardware >> failure, later >> any read will trigger memory error and memory_failure should handle that. > > I've only tested injecting an error on the shared zeropage using > corrupt-pfn: > > echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn > > But no memory error was triggered on a subsequent read ... > > Anyway, I'm trying to explore other ways to simulate hardware failure :) > > Thanks, > Lance >